notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-23 00:41:35 -05:00

Author	SHA1	Message	Date
Ben Thorner	779b8e941f	Rewrite broadcast Zendesk alert at approval time The new alert happens earlier but is otherwise the same: - We only create a ticket in Production. - We only create a ticket on approval. I took this opportunity to refactor the alert as a private function and test this specifically in detail to avoid lots of repetitive mocks, which are required when calling the main "update" function. One test I haven't preserved was for when the "names" array is empty, as this was added for a legacy data integrity scenario [^1]. [^1]: `bf0bf4e31c`	2022-04-05 12:57:08 +01:00
Ben Thorner	3988a6cd07	Include exception info in SMS warning log This makes it easier to debug failures when adding a new provider.	2022-03-30 13:36:56 +01:00
Ben Thorner	b439fd0718	Add boilerplate for Reach SMS callbacks This is enough to update a notification in DB: 1. First create a notification in the UI and sent it. 2. Then reset its attributes to pretend it's for Reach. update notifications set sent_at = null, sent_by = null, notification_status='sending' where id='some-uuid'; 3. Change "notification_id" to "<some-uuid>" in the code. 4. Call the boilerplate endpoint for Reach callbacks. curl -X POST localhost:6011/notifications/sms/reach Interestingly there's no foreign key constraint on "sent_by" in the DB, so this just works: the notification is updated.	2022-03-24 16:56:33 +00:00
Leo Hemsted	2fbe9e85ac	Merge pull request #3479 from alphagov/auto-retry-stuck-av-letters automatically retry letters stuck in pending-virus-scan	2022-03-15 11:43:42 +00:00
Leo Hemsted	9e8df8b623	remove "letters stuck pending av" runbook there's not anything we know we need to do now that we resolve stuck letters automatically. Letters couuld still get into this state, so it's worth alerting us. However, we don't have anything concrete that we know how to fix these letters, so we should just remove the runbook entirely.	2022-03-10 14:10:01 +00:00
David McDonald	0d952b4d8c	Reduce timeout for service callback attempt to 5 seconds It is currently 60 seconds but we have had two incidents in the past week where there is a connection error talking to a service and the request takes up to 60 seconds before failing. When this happens, if there are a few of these callbacks then all of them will completely hog the service callback worker and build up a big queue of all the other service callbacks. 5 seconds has been chosen as that is still a pretty decent length time for a simple web request that should just be giving them a little bit of information for them to store. 5 seconds should be a sufficient enough reduction that we dramatically reduce this problem for the moment. Open to this number being changed in the future based on how we see it perform.	2022-03-08 13:05:32 +00:00
Leo Hemsted	00259893f1	automatically retry letters stuck in pending-virus-scan Since sept 2019 we've had to log on to production around once every twenty days to restart the virus scan task for a letter. Most of the time this is just a case of making sure the file is in the scan bucket, and then triggering the task. If the file isn't in the scan bucket we'd need to do some more manual investigation to find out exactly where the file got stuck, but I can only remember times when it's been in the scan bucket. So if the file is in the scan bucket, we can just check that with code and kick the task off automatically.	2022-03-07 18:31:46 +00:00
Katie Smith	514bd48614	Update flake8-bugbear from 20.11.1 to 22.1.11 And ignore a warning, since I did not think that in this case "Using .strip() with multi-character strings is misleading the reader".	2022-03-02 16:51:09 +00:00
Ben Thorner	a69d1635a1	Update FactStatus table in bulk for each service Previously we were looping over data from the Notifications/History table and then shovelling it into the status table, one row at a time - plus an extra delete to clean up any existing data. This replaces that with a batch insertion, similar to how we archive notifications [1], but using a simple subquery (via "from_select" [2]) instead of a temporary table. To make the select compatible with the insert, I've used "literal" to inject the constant pieces of data, so each row has everything it needs to go into the status table. [1]: `9ce6d2fe92/app/dao/notifications_dao.py (L295)` [2]: https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select	2022-02-16 13:40:05 +00:00
Ben Thorner	ef231d5de7	Fix task name and action in status task logs	2022-02-16 11:45:45 +00:00
Ben Thorner	7f4b140f97	Rename function to make it consistent This is consistent with the new "on_date" function. It was going off the edge of my screen before in some parts of the code.	2022-02-09 17:39:08 +00:00
Ben Thorner	1213463b8e	Only aggregate status when necessary for a service This takes a similar approach to the nightly deletion task so that we only create sub-tasks when there are actually notifications to aggregate for a given type and day [1]. We're making this change to stop the duplication errors we're getting at the moment and ensure the task can scale to more messages and more services. There are two parts to this: - Each subtask should now run within the 5 minute visibility timeout. However, they may still be duplicated if the parent task overruns [2]. - The parent task creates a mininal number of subtasks, and the query to determine this is very fast for a normal process day (milliseconds). Since all tasks will run quickly, there should be no more duplication. In order to test this more nuanced task, I rewrote the tests: - One test checks the subtask is called correctly. - One test checks we create all the right subtasks. [1]: https://github.com/alphagov/notifications-api/pull/3381 [2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl	2022-02-09 17:39:07 +00:00
Ben Thorner	c8db58d0e8	Reorder loops for creation status agg sub tasks This will help tailor the innermost loop on services.	2022-02-09 17:39:06 +00:00
Ben Thorner	d6678b6a70	Remove unnecessary logs from status aggreagtion These can be inferred elsewhere: - Task creation is obvious from task execution. If we're concerned about a specific service, we can check the updated times on the DB records, since all records are recreated each time this runs. - Task starting is already logged. - Task completion is already logged. The number of rows updated can also be inferred from the DB. The log I've found useful is the one about fetching the data, and I've also added another to time how long it takes to insert the data, as both could be sources of poor performance. Arguably we should use metrics for this sort of thing, but logs are easier in practice for the metric systems we have.	2022-02-09 17:39:05 +00:00
Ben Thorner	018a253b6f	Revert "Revert running status aggregation in parallel" This reverts commit `0f6dea0deb`.	2022-02-09 17:39:00 +00:00
Chris Hill-Scott	7f72d3a60f	Bump utils to 53.0.0 Changes: 53.0.0 --- * `notifications_utils.columns.Columns` has moved to `notifications_utils.insensitive_dict.InsensitiveDict` * `notifications_utils.columns.Rows` has moved to `notifications_utils.recipients.Rows` * `notifications_utils.columns.Cell` has moved to `notifications_utils.recipients.Cell` 52.0.0 --- * Deprecate the following unused `redis_client` functions: - `redis_client.increment_hash_value` - `redis_client.decrement_hash_value` - `redis_client.get_all_from_hash` - `redis_client.set_hash_and_expire` - `redis_client.expire` 51.3.1 --- * Bump govuk-bank-holidays to cache holidays for next year.	2022-02-08 09:45:10 +00:00
Rebecca Law	09c8fbe982	Merge pull request #3418 from alphagov/letters-too-long Mark letters as validation-failed if the templated letter is too long.	2022-02-02 08:30:50 +00:00
Rebecca Law	c01c81326c	Update log message to something a little easier to read and query for.	2022-01-24 12:25:53 +00:00
Leo Hemsted	246016a894	don't log if we dont delete anything for a service we try and delete for lots of services. this includes services that don't actually have anything to delete that day. that might be because they had a custom data retention so we always go to check them, or because they only sent test notifications (which we'll delete but not include in the count in the log line). we don't really need to see log lines saying that we didn't delete anything for that service - that's just a long list of boring log messages that will hide the actual interesting stuff - which services we did delete content for.	2022-01-21 11:04:37 +00:00
Ben Thorner	0f6dea0deb	Revert running status aggregation in parallel The top-level task didn't run successfully after this was deployed due to the worker being killed due to heavy disk usage. While the more parallel version does log much more, it doesn't totally explain the disk behaviour. Nonetheless, reverting it is sensible to give us the time we need to investigate more.	2022-01-20 12:22:33 +00:00
Rebecca Law	6cd7a23d3c	If there is an invalid letter that has not been updated to `validation-failed` because the `update-validation-failed-for-templated-letter` has not been picked up off the letter-tasks queue and the `collate-letter-pdfs-to-be-sent` has started. 1. The number of letters that we send to DVLA will be not be correct (see `20ead82463/app/celery/letters_pdf_tasks.py (L136)`) This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported. 2. When we get the PDF from S3 we will get a file not found `20ead82463/app/celery/letters_pdf_tasks.py (L244)` The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions. Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.	2022-01-19 08:31:19 +00:00
Rebecca Law	841a4fc22f	Mark letters as validation-failed if the templated letter is too long. It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications. NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload. A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task. New work flow for a letter in brief: API - receives POST /v2/notifications :: save to db :: put CREATE_LETTERS_PDF task on queue for template preview to consume TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF :: create PDF :: count pages of PDF :: IF page count exceeds 10 pages put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters) put `update-validation-failed-for-templated-letter` task on the queue for the API to consume ELSE put PDF in the `letters-pdf` bucket put `update-billable-units-for-letter` task on the queue API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task :: IF `update-billable-units-for-letter` task: update billable units for notification as usual :: ELSE `update-validation-failed-for-templated-letter`: update notification_status = `validation-failed` ADMIN - view notification page for letter :: show validation letter for templated letter There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app. Deployment plan Deploy Admin first Deploy API Deploy template-preview Related PRs: alphagov/notifications-template-preview#619 alphagov/notifications-admin#4107 https://www.pivotaltracker.com/story/show/169209742	2022-01-19 08:29:48 +00:00
Ben Thorner	9686595fa8	Minor tweaks to address comments on the PR To address: - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786867994 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786853329 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786848793 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786214794	2022-01-18 16:56:53 +00:00
Ben Thorner	9182ebf4e5	Parallelise status aggregation by service and day This follows a similar approach as [1]. Recently we've seen lots of errors from this task, which we think are a consequence of it doing too much work and tripping Celery's visibility timeout. While we can optimise the query [2], it's likely the errors will return as the number of live services grows. Parallelising the aggregation now will make it more futureproof. [1]: https://github.com/alphagov/notifications-api/pull/3397 [2]: https://github.com/alphagov/notifications-api/pull/3417	2022-01-12 15:47:59 +00:00
Ben Thorner	d772ae6b46	Standardise logs for status aggregation tasks This will make it easier to parallelise by service later on.	2022-01-12 15:47:57 +00:00
Ben Thorner	4feed950c4	DRY-up loops to kick off status aggregation tasks This will make it easier to parallelise by service in the following commits, since we only have one loop to change.	2022-01-12 15:47:56 +00:00
Ben Thorner	ddbf556486	Rewrite task to aggregate status by service This is a step towards parallelising the task by service and day.	2022-01-12 15:47:53 +00:00
Ben Thorner	a7b39a930c	Add comment to explain status aggregation approach This relates to the performance optimisation work we're doing [1]. Before optimising the task, it's worth asking if we can do less - the comment explains why it has to be this way. Some references to back up the comment: - We do status updates in either table [2]. - We don't allow duplicate receipts for emails [3]. - We don't allow duplicate receipts for SMS [4]. - We don't expect duplicate receipts for letters. This is something we would need to revisit if we want to support additional status updates - we could reject based on the age of the notification, rather than the status. [1]: https://github.com/alphagov/notifications-api/pull/3417 [2]: `20ead82463/app/dao/notifications_dao.py (L538)` [3]: `20ead82463/app/celery/process_ses_receipts_tasks.py (L58)` [4]: `20ead82463/app/dao/notifications_dao.py (L129-L135)`	2022-01-10 18:15:54 +00:00
Leo Hemsted	228d72dc8f	update log messages in delete task. less prose, clearer output. (hopefully)	2021-12-14 15:24:35 +00:00
Leo Hemsted	49cc1b643f	split delete task up into per service we really don't gain anything by running each service delete in sequence - we get the services, and then just loop through them deleting per service. By deleting per service in separate tasks, we can take advantage of parallelism. the only thing we lose is some log lines but I don't think we're that interested in them. only set query limit at the move_notifications dao function - the task doesn't really care about the technical implementation of how it deletes the notifications	2021-12-14 15:24:34 +00:00
Ben Thorner	c8cf057eba	Record providers we time out notifications for This will help us monitor issues with delivery receipts and keep track of provider performance over time. I'm not concerned about performance here: - The number of notifications to time out is usually small. - This task only runs once a day. - Calls to StatsD are quick and cheap.	2021-12-14 13:04:39 +00:00
Ben Thorner	87cd40d00a	Scale timeout task to work on arbitrary volumes Previously this was limited to 500K notifications. While we don't expect to reach this limit, it's not impossible e.g. if we had a repeat of the incident where one of our providers stopped sending us status updates. Although that's not great, it's worse if our code can't cope with the unexpectedly high volume. This reuses the technique we have elsewhere [1] to keep processing in batches until there's nothing left. Specifying a cutoff point means the total amount of work to do can't keep growing. [1]: `2fb432adaf/app/dao/notifications_dao.py (L441)`	2021-12-13 17:14:28 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	ab4cb029df	Remove alert for email / sms in created In response to [1]. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988 It turns out the code that inspired this new alert - in the old "timeout-sending-notifications" task - was actually redundant as we already have a task to "replay" notifications still in "created", which is much better than just alerting about them. It's possible the replayed notifications will also fail, but in both cases we should see some kind of error due to this, so I don't think we're losing anything by not having an alert.	2021-12-06 14:11:42 +00:00
Ben Thorner	04da017558	DRY-up conditionally creating callback tasks This removes 3 duplicate instances of the same code, which is still tested implicitly via test_process_ses_receipt_tasks [1]. In the next commit we'll make this test more explicit, to reflect that it's now being reused elsewhere and shouldn't change arbitrarily. We do lose the "print" statement from the command instance of the code, but I think that's a very tolerable loss. [1]: `16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)`	2021-12-06 14:11:34 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00
Ben Thorner	f96ba5361a	Add new task to alert about created email / sms This will log an error when email or SMS notifications have been stuck in 'created' for too long - normally they should be 'sending' in seconds, noting that we have a goal of < 10s wait time for most notifications being processed our platform. In the next commits we'll decouple similar functionality from the existing 'timeout-sending-notifications' task.	2021-12-06 14:00:31 +00:00
Leo Hemsted	f6d210f1e6	put delete tasks on the reporting worker they share a lot with the reporting tasks (creating ft_billing and ft_notification_status), in that they're run nightly, take a long time, and we see error messages if they get run multiple times (due to visibility timeout). The periodic app has two concurrent processes - previously there was just one delete task, which would use one of those processes, while the other process would pick up anything else on the queue (at that time of night, the regular provider switch checks and scheduled job checks). However, when we switched to running the three delete notification types separately, we saw visibility timeout issues - three tasks would be created, all three would be picked up by one celery instance, the two worker processes would start on two of them, and the third would sit on the box, wait longer than the visibility timeout to be picked up (and acknowledged), and so SQS would assume the task was lost and replay it. it's queues all the way down! By putting them on the reporting worker we can take advantage of tuning that app (for example setting the prefetch multiplier to one) which is designed to run large tasks. We've also got more concurrent workers on this box, so we can run all three tasks at once.	2021-12-03 13:28:16 +00:00
Leo Hemsted	6bbec9f103	make delete notification tasks parallel by notification type we used to do this until apr 2020. Let's try doing it again. Back then, we had problems with timing. We did two things in spring 2020: We moved to using an intermediary temp table [1] We stopped the tasks being parallelised [2] However, it turned out the real time saving was from changing what services we delete for [3]. The task was actually CPU-bound rather than DB-bound, so that's probably why having the tasks in parallel wasn't helping, since they were all competing for the same CPU. It's worth trying the parallel steps again now that we're no longer CPU bound. Note: Temporary tables are in their own postgres schema, and are only viewable by the current session (session == connection. Each celery worker process has its own db connection). We don't need to worry about separate workers both trying to use the same table at once. I've also added a "DROP ON COMMIT" directive to the table definition just to ensure it doesn't persist past the task even if there's an exception. (This also drops on rollback). Cronitor looks at the three functions separately so we don't need to worry about the main task taking milliseconds where it used to take hours as it isn't monitored itself. I've also removed some unnecessary redundant exception logs. [1] https://github.com/alphagov/notifications-api/pull/2767 [2] https://github.com/alphagov/notifications-api/pull/2798 [3] https://github.com/alphagov/notifications-api/pull/3381	2021-12-01 14:28:08 +00:00
Ben Thorner	bffca39223	Merge pull request #3373 from alphagov/centralise-celery-180213914 Use central NotifyCelery base class in utils	2021-11-17 12:09:28 +00:00
Ben Thorner	666ac1ab4f	Log activity on all periodic Celery tasks As stated in the comment, this would have been helpful during an incident to give further reassurance that a task had at least started running - at the time the only evidence for this was the Cronitor dashboard itself, which we don't often look at. I've removed other, equivalent "starting" logs, but kept those that provide additional information in the log message.	2021-11-17 09:48:03 +00:00
Ben Thorner	528223ed61	Use central NotifyCelery base class in utils Note that the new base class doesn't include a bespoke feature we had here: 'log_on_worker_shutdown'. We've agreed it's reasonable to remove it for now as it was introduced many years ago and its use case is unclear - we can always add it back if needed.	2021-11-16 13:58:12 +00:00
Chris Hill-Scott	0236318189	Republish gov.uk/alerts every night to clear down planned tests We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner for the whole of the day when there has been an operator test on that day. We need to remove the banner when the day is over. The most straightforward way to do this is to republish the site at the start of every day. The gov.uk/alerts code[1] will work out if there are or aren’t any planned tests to show that day. 1. `5a274af6d0/app/models/alerts.py (L38-L44)`	2021-11-15 14:23:32 +00:00
Ben Thorner	48e1482d90	Merge pull request #3366 from alphagov/celery-extend-request-id-180213914 Extend request tracing to cover Celery logs	2021-11-12 11:10:38 +00:00
Ben Thorner	d66c68d6d6	Merge pull request #3364 from alphagov/celery-headers-request-id-180213914 Move Celery task Request ID injection into headers	2021-11-12 11:10:29 +00:00
Ben Thorner	ac06529128	Enable request tracing on Celery success/fail logs Previously these logs wouldn't have a Request ID attached since the Celery hooks run after the __call__ method where we enable request tracing for normal application logs. For the failure log especially it will be useful to have this feature.	2021-11-10 18:04:20 +00:00
Ben Thorner	369a9f7521	Refactor queue_name and request_id into properties This reduces the complexity of the original functions, which will go up a bit in the next commit.	2021-11-10 18:04:19 +00:00
Ben Thorner	89a8dd1a03	Move Celery task Request ID injection into headers Previously we passed along this piece of state via the kwargs for a task, but this runs the risk of the task accidentally receiving the extra kwarg unless we've covered all the code paths that could invoke it directly e.g. retries don't invoke __call__. This switches to using Celery "headers" to pass the extra state. It turns out that a Celery has two "header" concepts, which leads to some confusion and even a bug with the framework [1]: - In older (pre v4.4) versions of Celery, the "headers" specified by apply_async() would become _the_ headers in the message that gets passed around workers, etc. These would be available later on via "self.request.headers". - Since Celery protocol v2, the meaning of "headers" in the message changed to become (basically) _all_ metadata about the task [2], with the "headers" option in apply_async() being merged [3] into the big dict of metadata. This makes using headers a bit confusing unfortunately, since the data structure we put in is subtly different to what comes out in the request context. Nonetheless, it still works. I've added some comments to try and clarify it. Note that one of the original tests is no longer necessary, since we don't need to worry about argument passing styles with headers. [1]: https://github.com/celery/celery/issues/4875 [2]: `663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)` [3]: `681a922220/celery/app/amqp.py (L495)`	2021-11-10 18:03:40 +00:00
Katie Smith	3d4796c924	Add task to resanitise and replace a PDF for precompiled letter This adds a task which is designed to be used if we want to recreate the PDF for a precompiled letter (either one that has been created using the API or one that has been uploaded through the website). The task takes the `notification_id` of the letter and passes template preview the details it needs in order to sanitise the original file and then replace the version in the letters-pdf bucket with the freshly sanitised version.	2021-11-10 09:51:31 +00:00
Ben Thorner	cdb43fbaf6	Only loop timeout task if there's more work Previously this would repeat the task even the current iteration of the loop had processed a non-full batch. This could cause the task to error incorrectly if one or two notifications breach the timeout threshold in between iterations.	2021-11-09 15:41:14 +00:00

1 2 3 4 5 ...

1003 Commits