notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-14 17:22:17 -05:00

Author	SHA1	Message	Date
Pea Tyczynska	8b7a0b88cb	Ensure that aws ses stub client is not run in production	2020-06-04 15:43:47 +01:00
Pea Tyczynska	6422a88c8c	Stub SES email client to avoid hitting SES during load testing If we set an environment variable, we can stub out calls to SES and send them to our own stub app. If the environment variable is not set, things work as normal. To be used alongside https://github.com/alphagov/notifications-email-provider-stub	2020-06-03 11:11:43 +01:00
Pete Herlihy	8a67e14e2b	Moving SMS supplier resting point to 50/50	2020-06-02 12:27:14 +01:00
Pea Tyczynska	3a00c19390	Polish and test the small task that updates billable units for letter	2020-05-11 13:32:09 +01:00
Pea Tyczynska	24a89c1c19	Modify tasks for getting letter pdf and updating billable units So that they talk with new template preview task for pdf creation	2020-05-11 13:30:59 +01:00
David McDonald	a237162106	Reduce concurrency and prefetch count of reporting celery app We have seen the reporting app run out of memory multiple times when dealing with overnight tasks. The app runs 11 worker threads and we reduce this to 2 worker threads to put less pressure on a single instance. The number 2 was chosen as most of the tasks processed by the reporting app only take a few minutes and only one or two usually take more than an hour. This would mean with 2 processes across our current 2 instances, a long running task should hopefully only wait behind a few short running tasks before being picked up and therefore we shouldn't see large increase in overall time taken to run all our overnight reporting tasks. On top of reducing the concurrency for the reporting app, we also set CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery docs because this app deals with long running tasks. https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit The chance in prefetch multiplier should again optimise the overall time it takes to process our tasks by ensuring that tasks are given to instances that have (or will soon have) spare workers to deal with them, rather than committing to putting all the tasks on certain workers in advance. Note, another suggestion for improving suggested by the docs for optimising is to start setting `ACKS_LATE` on the long running tasks. This setting would effectively change us from prefetching 1 task per worker to prefetching 0 tasks per worker and further optimise how we distribute our tasks across instances. However, we decided not to try this setting as we weren't sure whether it would conflict with our visibility_timeout. We decided not to spend the time investigating but it may be worth revisiting in the future, as long as tasks are idempotent. Overall, this commit takes us from potentially having all 18 of our reporting tasks get fetched onto a single instance to now having a process that will ensure tasks are distributed more fairly across instances based on when they have available workers to process the tasks.	2020-04-28 10:47:46 +01:00
Leo Hemsted	ad419f7592	restart reporting worker after each task The reporting worker tasks fetch large amounts of data from the db, do some processing then store back in the database. As the reporting worker only processes the create nightly billing/stats table tasks, which aren't high performance or high volume, we're fine with the performance hit from restarting the worker between every task (which based on limited local testing takes about a second or so). This causes some real funky shit with the app_context (used for accessing current_app.logger). To access flask's global state we use the standard way of importing `from flask import current_app`. However, processes after the first one don't have the current_app available on shut down (they're fine during the actual task running), and are unable to call `with current_app.app_context()` to create it. They _are_ able to call `with app.app_context()` to create it, where `app` is the initial app that we pass in to `NotifyCelery.init_app`. NotifyCelery.init_app is only called once, in the master process - I think the application state is then stored and passed to the celery workers. But then it looks like the teardown might clear it, but it never gets set up again for the new workers? Unsure. To fix this, store a copy of the initial flask app on the NotifyCelery object and then use that from within the shutdown signal logging function. Nothing's ever easy ¯\_(ツ)_/¯	2020-04-24 12:28:25 +01:00
Katie Smith	92c979fcdb	Restore provider resting points We temporarily updated the provider resting points in https://github.com/alphagov/notifications-api/pull/2804. This puts the provider resting points back to their original value now that both providers seem to be functioning well.	2020-04-20 10:45:19 +01:00
David McDonald	9b1b9ea75b	Merge pull request #2798 from alphagov/delete-notifications-older-than-retention Delete notifications older than retention	2020-04-15 12:14:07 +01:00
David McDonald	8a27b3b9f2	Move provider resting balance as temporary measure We are seeing issues with one of our providers. Move all traffic to the other. We have done this with the provider load balancer, however this will be reverted back 10 percent every hour to these resting points, which we want to stop happening until we are confident our provider has fixed their issues. In the long run, we should add functionality to pause our load balancer behaviour.	2020-04-14 18:07:25 +01:00
David McDonald	f512fc63a1	Run `delete-notifications-older-than-retention` at 3am Moving it from 4:15am to 3:00. This will mean we do more deleting before the 'work day' starts and improve our DB performance. I've looked at the last 14 day of logs for the `create-nightly-notification-status` subtasks and the `create-nightly-billing` subtasks. The latest they appear to finish is 2.30AM. There were some outliers but I believe these were people running the tasks in the middle of the day as a manual process. Obviously, this still means there is a risk that those tasks conflict with `delete-notifications-older-than-retention`, even more so now that we move this to 3am.	2020-04-07 17:04:31 +01:00
David McDonald	085e3a8435	Make deleting of notifications to be sequential Based on this https://github.com/alphagov/notifications-api/pull/2788 where some concerns were raised. This should be a quicker fix to get our the deletions to run sequentially for all the notification types. Note, email is first as most important as makes up the larger numbers (we wouldn't want it to start with SMS, fail half way for a reason that only affects SMS and for that to affect the email deletion). We hope that by running sequentially we will reduce conflicts writing to the same index and this will speed up the total time it takes to finish deleting all notification types older than their retention time. There is a risk that whilst quicker per job, as they now run sequentially rather than potentially overlapping, they will take longer overall. We will need to monitor to see.	2020-04-07 17:03:17 +01:00
Leo Hemsted	9673619519	Update provider splits also fix tests so they're independent of future config changes	2020-04-06 15:16:00 +01:00
Chris Hill-Scott	4aed012d92	Merge pull request #2759 from alphagov/delete-contact-list Add an endpoint to delete a contact list	2020-03-27 14:50:51 +00:00
Rebecca Law	dc44cb29d1	To make the deployment and testing a little easier move the high volume service ids to the credential repo. This way we can only add the ids when we are ready and all the infrastrure for the new service has been applied.	2020-03-27 08:02:51 +00:00
Chris Hill-Scott	4a6143aeb1	Remove the list from S3 once we don’t need it Once a contact list is gone from the database there’s no way to reference it again. Any jobs have made their own copy. So we can clean it up, meaning we’re not storing personal data longer than we need to.	2020-03-26 17:42:38 +00:00
Rebecca Law	d0a2e0f3ce	Move high volume service id to config	2020-03-25 15:25:45 +00:00
Rebecca Law	db4b4d929d	- If the task runs twice and the notification already exists ignore the primary key constraint. - Remove prints - Add some more tests - Only allow the new method to run for emails	2020-03-25 12:39:15 +00:00
Rebecca Law	a13bcc6697	Reduce the pressure on the db for API post email requests. Instead of saving the email notification to the db add it to a queue to save later. This is an attempt to alleviate pressure on the db from the api requests. This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id. In a nutshell: - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db. - create a save_api_email task to persist the notification - return the notification - New worker app to process the save_api_email tasks.	2020-03-25 07:59:05 +00:00
Katie Smith	3a07d1e13d	Create new sms-callbacks queue The `delivery-worker-receipts` app will listen to this new queue, which will be used for processing the responses from Firetext and MMG.	2020-03-19 13:41:14 +00:00
David McDonald	f56795655e	Remove unused STATSD_PREFIX variable We moved from sending statsd metrics to hosted graphite to sending to one that is running on the paas. Therefore we no longer need to send statsd metrics to a particular prefix at the statsd app as it is only receiving statsd metrics from our apps (not other users like would have been the case with HostedGraphite). This should change no behaviour as the only place the environment variable was being used was in the gunicorn config and it was an empty string which is the default behaviour anyway as per: https://docs.gunicorn.org/en/stable/settings.html#statsd-prefix	2020-03-05 10:41:26 +00:00
David McDonald	e6767590d4	Change function and task name to be more accurate Will require us to change a cronitor set up	2020-02-21 15:01:19 +00:00
David McDonald	2dc5550159	Change variable name to make more descriptive Also remove unnecessary if statement Also add manifest change to make sure relevant environment variables makes it into the app	2020-02-20 15:48:15 +00:00
David McDonald	7246306447	Support multiple secrets for ADMIN_CLIENT_SECRETS This will allow us to accept two different ones and therefore allow us to rotate the secret that the admin client is sending to the API Due to how the notifications-python-client throws exceptions, we run into exactly the same issue with not being able to distinguish if a `TokenDecodeError` is thrown because the token was encrypted with a different secret key or if because there was a different error when decoding. I've copied the TODO from `requires_auth` as this is exactly the same issue. I've also added a test case for functionality that was missing for an out of date admin token (old IAT).	2020-02-20 13:47:39 +00:00
David McDonald	52d3df49d4	Make `ADMIN_CLIENT_SECRET` a list of a single secret And support this change across our code. Note, this is a halfway step where it is not a list rather than a string but still only supports a single secret, ie one item in the list.	2020-02-20 13:43:10 +00:00
David McDonald	a14d5f0225	Remove task that no longer runs We no longer puts files in these s3 buckets (and have in fact deleted the buckets) therefore this task is redundant and can be removed.	2020-02-06 10:57:43 +00:00
Rebecca Law	f4c0f70ba9	Send the alert for `letters-still-sending` an hour earlier. These alerts are sent to our postal provider. And it usually arrives as they are getting ready to go home for the day or the weekend. Which means they get missed/overlooked. They have agreed to get the alert an hour earlier, perhaps that will improved the response time.	2020-01-13 10:42:30 +00:00
Katie Smith	8f144be29c	Add config for new template preview task Added the queue and task names for the new template preview task to the config. Also added the new bucket name that template preview will use for the sanitised letters to the config for all environments.	2019-12-16 11:30:56 +00:00
Leo Hemsted	31d1abd6d1	add task to move sms providers back towards shared load we generally aim to share the load between the two providers equally (more or less). When one provider has struggled, we deprioritise them, this commit adds a function that gradually restores balance. It checks every five minutes, if it's been more than an hour since the providers were last changed then it adjusts them towards a 50/50 split. Except it's not quite 50/50 due to #reasons (we want to slightly favour MMG), it's actually 60/40. That's defined in a new dict in config.py.	2019-12-13 10:02:39 +00:00
Pea M. Tyczynska	2019070536	Merge pull request #2667 from alphagov/warn-team-about-high-failure-rates Warn team about high failure rates	2019-12-09 11:28:25 +00:00
Pea Tyczynska	d72ab4f4a6	Send zendesk ticket when services found with high failure rates	2019-12-06 16:57:04 +00:00
Leo Hemsted	4701e5d9af	don't define MMG_URL and FIRETEXT_URL in manifest these URLs never change, and it lead to surprising issues where an updated default MMG_URL wasn't actually respected on PaaS. These urls aren't private and don't need to be stored in credentials. By not defining them in the manifest, we expect them to use the default unless `cf set-env` has been specifically used to modify them in an app.	2019-12-04 15:26:49 +00:00
Leo Hemsted	6b9afa358f	update utils to bring in full welsh diacritics range note: this includes updating the MMG api url to their v2a api. Their previous API doesn't include support for capital o with grave accent (Ò)	2019-11-28 15:12:52 +00:00
Rebecca Law	9def176e7a	Fix typo in config	2019-11-06 10:56:37 +00:00
Rebecca Law	74546a265e	Added cron for check_for_missing_rows_in_completed_jobs Run the task every 10 minutes. Does that seem reasonable? Maybe that is too often.	2019-11-06 10:49:46 +00:00
Leo Hemsted	e094dd4bfd	remove loadtesting from providers we don't use it since we wrote our own provider stubs for performance tests. this removes it from the api - it's still in the DB and will be retrieved by queries, but is set to disabled on prod	2019-10-23 11:45:07 +01:00
Katie Smith	a241fe4a29	Add transient uploaded letters bucket to config	2019-09-12 09:56:10 +01:00
Leo Hemsted	1d9fd775d3	move delete tasks to 4am just to make sure they definitely run after the create tasks	2019-08-21 11:15:49 +01:00
Leo Hemsted	8f13697cf1	Revert "trigger nightly delete tasks from the create notification status task" This reverts commit `58f24a0a83`.	2019-08-19 16:06:25 +01:00
Leo Hemsted	92d78956be	Merge pull request #2592 from alphagov/reporting-worker Add reporting worker	2019-08-15 17:22:27 +01:00
Leo Hemsted	3a0bf2b23e	Add reporting worker also remove references to unused statistics queue	2019-08-15 16:42:15 +01:00
Leo Hemsted	58f24a0a83	trigger nightly delete tasks from the create notification status task the nightly tasks need to run after the create nightly notification status task - so that test notifications are still there to record stats for, and to stop the risk of deleting notificaitons part-way through recording stats for them.	2019-08-14 18:04:45 +01:00
Leo Hemsted	7b8028d03f	fix typo in config was a `,`, not a `:`, so 'options' was a set rather than a dictionary.	2019-08-13 15:19:28 +01:00
Leo Hemsted	2b06e810c5	Lower the max dvla zip size from 500mb to 40mb There's a bug in pysftp that appears to cause quadratic performance loss. See https://github.com/paramiko/paramiko/issues/1141 for more details. As a temporary band-aid fix, lower the size of the files we're sending.	2019-07-29 17:23:22 +01:00
Katie Smith	cec87a9de0	Delete unused code * The `_should_record_notification_in_history_table` function stopped being used in this commit: `c23ae15f32` * `NOTIFICATIONS_ALERT` stopped being used in this commit: `5aa37f09b6`	2019-07-12 16:43:37 +01:00
Leo Hemsted	07bb0f0332	send emails when MOU is signed we build up one personalisation dict, and then pass it in to all the different templates - so be careful editing things. also of note, we check if the agreement_signed_on_behalf_of is set, and send a different template with slightly different wording to the person who clicked the confirm button.	2019-07-12 15:08:55 +01:00
Katie Smith	c518f6ca76	Add scheduled task to find old letters which still have 'created' status Added a scheduled task to run once a day and check if there were any letters from before 17.30 that still have a status of 'created'. This logs an exception instead of trying to fix the error because the fix will be different depending on which bucket the letter is in.	2019-06-18 10:58:58 +01:00
Katie Smith	a2f324ad7e	Add scheduled task to find precompiled letters in wrong state Added a task which runs twice a day on weekdays and checks for letters that have been in the state of `pending-virus-check` for over 90 minutes. This is just logging an exception for now, not trying to fix things, since we will need to manually check where the issue was.	2019-06-18 10:58:58 +01:00
Pea Tyczynska	5f1f688c7b	Create template to verify service email reply-to addresses So that template with the same ID is present on all environments	2019-05-28 15:14:09 +01:00
Pea Tyczynska	615ea6a98a	Send verifcation email to a new reply-to email address	2019-05-23 15:36:09 +01:00

1 2 3 4 5 ...

254 Commits