notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-24 01:11:38 -05:00

Author	SHA1	Message	Date
Ryan Ahearn	e3ad01119d	Replace celery[sqs] with celery[redis]	2022-09-29 08:59:17 -04:00
jimmoffet	c636eac964	replace m2crypto with oscrypto	2022-09-23 15:57:06 -07:00
jimmoffet	ea3eefa81c	test branch for notify-api-alt temporary deploy	2022-09-23 11:56:39 -07:00
jimmoffet	4c86024f21	clean up comments	2022-09-20 20:22:12 -07:00
jimmoffet	a03de0dd56	remove outdated validatesns library and replace with maintainable code	2022-09-20 20:11:09 -07:00
jimmoffet	f1aec54665	clean up comments and method dupes	2022-09-15 15:48:37 -07:00
jimmoffet	b0f819dbd9	canada UK ses callbacks monster mash	2022-09-15 14:59:13 -07:00
Ryan Ahearn	806e2ad2dc	Review and update uses of PRNG	2022-08-19 15:26:12 +00:00
Ryan Ahearn	e77cedb039	Clean up xml finding from static-scan	2022-08-18 17:52:44 +00:00
Christa Hartsock	af6495cd4c	Get tests passing locally When we cloned the repository and started making modifications, we didn't initially keep tests in step. This commit tries to get us to a clean test run by skipping tests that are failing and removing some that we no longer expect to use (MMG, Firetext), with the intention that we will come back in future and update or remove them as appropriate. To find all tests skipped, search for `@pytest.mark.skip(reason="Needs updating for TTS:`. There will be a brief description of the work that needs to be done to get them passing, if known. Delete that line to make them run in a standard test run (`make test`).	2022-07-07 15:41:15 -07:00
Jim Moffet	aa4ec532a4	implement SNS	2022-06-17 11:16:23 -07:00
Jim Moffet	59b72f4853	add devcontainer configs and docker network orchestration	2022-06-13 13:16:32 -07:00
Ben Thorner	458e997706	Recalculate billing rows for 10 days (prev. 4) This effectively reverts [^1], which was only a temporary change. I suspect the performance problem will go away with [^2]. While we've clearly been managing without this change, it resulted in several rows being left as incorrect when letter receipts were delayed. It makes sense for us to run this task for the same period as we do to aggregate statuses, as status affects billing. [^1]: `e5c76ffda7` [^2]: https://github.com/alphagov/notifications-api/pull/3542	2022-05-17 17:38:08 +01:00
Ben Thorner	c27107fa74	Remove support for Reach provider This provider was never active and support was never completed, so there's little value in keeping all this potentially confusing code.	2022-04-29 12:28:08 +01:00
Ben Thorner	779b8e941f	Rewrite broadcast Zendesk alert at approval time The new alert happens earlier but is otherwise the same: - We only create a ticket in Production. - We only create a ticket on approval. I took this opportunity to refactor the alert as a private function and test this specifically in detail to avoid lots of repetitive mocks, which are required when calling the main "update" function. One test I haven't preserved was for when the "names" array is empty, as this was added for a legacy data integrity scenario [^1]. [^1]: `bf0bf4e31c`	2022-04-05 12:57:08 +01:00
Ben Thorner	3988a6cd07	Include exception info in SMS warning log This makes it easier to debug failures when adding a new provider.	2022-03-30 13:36:56 +01:00
Ben Thorner	b439fd0718	Add boilerplate for Reach SMS callbacks This is enough to update a notification in DB: 1. First create a notification in the UI and sent it. 2. Then reset its attributes to pretend it's for Reach. update notifications set sent_at = null, sent_by = null, notification_status='sending' where id='some-uuid'; 3. Change "notification_id" to "<some-uuid>" in the code. 4. Call the boilerplate endpoint for Reach callbacks. curl -X POST localhost:6011/notifications/sms/reach Interestingly there's no foreign key constraint on "sent_by" in the DB, so this just works: the notification is updated.	2022-03-24 16:56:33 +00:00
Leo Hemsted	2fbe9e85ac	Merge pull request #3479 from alphagov/auto-retry-stuck-av-letters automatically retry letters stuck in pending-virus-scan	2022-03-15 11:43:42 +00:00
Leo Hemsted	9e8df8b623	remove "letters stuck pending av" runbook there's not anything we know we need to do now that we resolve stuck letters automatically. Letters couuld still get into this state, so it's worth alerting us. However, we don't have anything concrete that we know how to fix these letters, so we should just remove the runbook entirely.	2022-03-10 14:10:01 +00:00
David McDonald	0d952b4d8c	Reduce timeout for service callback attempt to 5 seconds It is currently 60 seconds but we have had two incidents in the past week where there is a connection error talking to a service and the request takes up to 60 seconds before failing. When this happens, if there are a few of these callbacks then all of them will completely hog the service callback worker and build up a big queue of all the other service callbacks. 5 seconds has been chosen as that is still a pretty decent length time for a simple web request that should just be giving them a little bit of information for them to store. 5 seconds should be a sufficient enough reduction that we dramatically reduce this problem for the moment. Open to this number being changed in the future based on how we see it perform.	2022-03-08 13:05:32 +00:00
Leo Hemsted	00259893f1	automatically retry letters stuck in pending-virus-scan Since sept 2019 we've had to log on to production around once every twenty days to restart the virus scan task for a letter. Most of the time this is just a case of making sure the file is in the scan bucket, and then triggering the task. If the file isn't in the scan bucket we'd need to do some more manual investigation to find out exactly where the file got stuck, but I can only remember times when it's been in the scan bucket. So if the file is in the scan bucket, we can just check that with code and kick the task off automatically.	2022-03-07 18:31:46 +00:00
Katie Smith	514bd48614	Update flake8-bugbear from 20.11.1 to 22.1.11 And ignore a warning, since I did not think that in this case "Using .strip() with multi-character strings is misleading the reader".	2022-03-02 16:51:09 +00:00
Ben Thorner	a69d1635a1	Update FactStatus table in bulk for each service Previously we were looping over data from the Notifications/History table and then shovelling it into the status table, one row at a time - plus an extra delete to clean up any existing data. This replaces that with a batch insertion, similar to how we archive notifications [1], but using a simple subquery (via "from_select" [2]) instead of a temporary table. To make the select compatible with the insert, I've used "literal" to inject the constant pieces of data, so each row has everything it needs to go into the status table. [1]: `9ce6d2fe92/app/dao/notifications_dao.py (L295)` [2]: https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select	2022-02-16 13:40:05 +00:00
Ben Thorner	ef231d5de7	Fix task name and action in status task logs	2022-02-16 11:45:45 +00:00
Ben Thorner	7f4b140f97	Rename function to make it consistent This is consistent with the new "on_date" function. It was going off the edge of my screen before in some parts of the code.	2022-02-09 17:39:08 +00:00
Ben Thorner	1213463b8e	Only aggregate status when necessary for a service This takes a similar approach to the nightly deletion task so that we only create sub-tasks when there are actually notifications to aggregate for a given type and day [1]. We're making this change to stop the duplication errors we're getting at the moment and ensure the task can scale to more messages and more services. There are two parts to this: - Each subtask should now run within the 5 minute visibility timeout. However, they may still be duplicated if the parent task overruns [2]. - The parent task creates a mininal number of subtasks, and the query to determine this is very fast for a normal process day (milliseconds). Since all tasks will run quickly, there should be no more duplication. In order to test this more nuanced task, I rewrote the tests: - One test checks the subtask is called correctly. - One test checks we create all the right subtasks. [1]: https://github.com/alphagov/notifications-api/pull/3381 [2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl	2022-02-09 17:39:07 +00:00
Ben Thorner	c8db58d0e8	Reorder loops for creation status agg sub tasks This will help tailor the innermost loop on services.	2022-02-09 17:39:06 +00:00
Ben Thorner	d6678b6a70	Remove unnecessary logs from status aggreagtion These can be inferred elsewhere: - Task creation is obvious from task execution. If we're concerned about a specific service, we can check the updated times on the DB records, since all records are recreated each time this runs. - Task starting is already logged. - Task completion is already logged. The number of rows updated can also be inferred from the DB. The log I've found useful is the one about fetching the data, and I've also added another to time how long it takes to insert the data, as both could be sources of poor performance. Arguably we should use metrics for this sort of thing, but logs are easier in practice for the metric systems we have.	2022-02-09 17:39:05 +00:00
Ben Thorner	018a253b6f	Revert "Revert running status aggregation in parallel" This reverts commit `0f6dea0deb`.	2022-02-09 17:39:00 +00:00
Chris Hill-Scott	7f72d3a60f	Bump utils to 53.0.0 Changes: 53.0.0 --- * `notifications_utils.columns.Columns` has moved to `notifications_utils.insensitive_dict.InsensitiveDict` * `notifications_utils.columns.Rows` has moved to `notifications_utils.recipients.Rows` * `notifications_utils.columns.Cell` has moved to `notifications_utils.recipients.Cell` 52.0.0 --- * Deprecate the following unused `redis_client` functions: - `redis_client.increment_hash_value` - `redis_client.decrement_hash_value` - `redis_client.get_all_from_hash` - `redis_client.set_hash_and_expire` - `redis_client.expire` 51.3.1 --- * Bump govuk-bank-holidays to cache holidays for next year.	2022-02-08 09:45:10 +00:00
Rebecca Law	09c8fbe982	Merge pull request #3418 from alphagov/letters-too-long Mark letters as validation-failed if the templated letter is too long.	2022-02-02 08:30:50 +00:00
Rebecca Law	c01c81326c	Update log message to something a little easier to read and query for.	2022-01-24 12:25:53 +00:00
Leo Hemsted	246016a894	don't log if we dont delete anything for a service we try and delete for lots of services. this includes services that don't actually have anything to delete that day. that might be because they had a custom data retention so we always go to check them, or because they only sent test notifications (which we'll delete but not include in the count in the log line). we don't really need to see log lines saying that we didn't delete anything for that service - that's just a long list of boring log messages that will hide the actual interesting stuff - which services we did delete content for.	2022-01-21 11:04:37 +00:00
Ben Thorner	0f6dea0deb	Revert running status aggregation in parallel The top-level task didn't run successfully after this was deployed due to the worker being killed due to heavy disk usage. While the more parallel version does log much more, it doesn't totally explain the disk behaviour. Nonetheless, reverting it is sensible to give us the time we need to investigate more.	2022-01-20 12:22:33 +00:00
Rebecca Law	6cd7a23d3c	If there is an invalid letter that has not been updated to `validation-failed` because the `update-validation-failed-for-templated-letter` has not been picked up off the letter-tasks queue and the `collate-letter-pdfs-to-be-sent` has started. 1. The number of letters that we send to DVLA will be not be correct (see `20ead82463/app/celery/letters_pdf_tasks.py (L136)`) This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported. 2. When we get the PDF from S3 we will get a file not found `20ead82463/app/celery/letters_pdf_tasks.py (L244)` The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions. Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.	2022-01-19 08:31:19 +00:00
Rebecca Law	841a4fc22f	Mark letters as validation-failed if the templated letter is too long. It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications. NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload. A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task. New work flow for a letter in brief: API - receives POST /v2/notifications :: save to db :: put CREATE_LETTERS_PDF task on queue for template preview to consume TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF :: create PDF :: count pages of PDF :: IF page count exceeds 10 pages put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters) put `update-validation-failed-for-templated-letter` task on the queue for the API to consume ELSE put PDF in the `letters-pdf` bucket put `update-billable-units-for-letter` task on the queue API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task :: IF `update-billable-units-for-letter` task: update billable units for notification as usual :: ELSE `update-validation-failed-for-templated-letter`: update notification_status = `validation-failed` ADMIN - view notification page for letter :: show validation letter for templated letter There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app. Deployment plan Deploy Admin first Deploy API Deploy template-preview Related PRs: alphagov/notifications-template-preview#619 alphagov/notifications-admin#4107 https://www.pivotaltracker.com/story/show/169209742	2022-01-19 08:29:48 +00:00
Ben Thorner	9686595fa8	Minor tweaks to address comments on the PR To address: - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786867994 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786853329 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786848793 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786214794	2022-01-18 16:56:53 +00:00
Ben Thorner	9182ebf4e5	Parallelise status aggregation by service and day This follows a similar approach as [1]. Recently we've seen lots of errors from this task, which we think are a consequence of it doing too much work and tripping Celery's visibility timeout. While we can optimise the query [2], it's likely the errors will return as the number of live services grows. Parallelising the aggregation now will make it more futureproof. [1]: https://github.com/alphagov/notifications-api/pull/3397 [2]: https://github.com/alphagov/notifications-api/pull/3417	2022-01-12 15:47:59 +00:00
Ben Thorner	d772ae6b46	Standardise logs for status aggregation tasks This will make it easier to parallelise by service later on.	2022-01-12 15:47:57 +00:00
Ben Thorner	4feed950c4	DRY-up loops to kick off status aggregation tasks This will make it easier to parallelise by service in the following commits, since we only have one loop to change.	2022-01-12 15:47:56 +00:00
Ben Thorner	ddbf556486	Rewrite task to aggregate status by service This is a step towards parallelising the task by service and day.	2022-01-12 15:47:53 +00:00
Ben Thorner	a7b39a930c	Add comment to explain status aggregation approach This relates to the performance optimisation work we're doing [1]. Before optimising the task, it's worth asking if we can do less - the comment explains why it has to be this way. Some references to back up the comment: - We do status updates in either table [2]. - We don't allow duplicate receipts for emails [3]. - We don't allow duplicate receipts for SMS [4]. - We don't expect duplicate receipts for letters. This is something we would need to revisit if we want to support additional status updates - we could reject based on the age of the notification, rather than the status. [1]: https://github.com/alphagov/notifications-api/pull/3417 [2]: `20ead82463/app/dao/notifications_dao.py (L538)` [3]: `20ead82463/app/celery/process_ses_receipts_tasks.py (L58)` [4]: `20ead82463/app/dao/notifications_dao.py (L129-L135)`	2022-01-10 18:15:54 +00:00
Leo Hemsted	228d72dc8f	update log messages in delete task. less prose, clearer output. (hopefully)	2021-12-14 15:24:35 +00:00
Leo Hemsted	49cc1b643f	split delete task up into per service we really don't gain anything by running each service delete in sequence - we get the services, and then just loop through them deleting per service. By deleting per service in separate tasks, we can take advantage of parallelism. the only thing we lose is some log lines but I don't think we're that interested in them. only set query limit at the move_notifications dao function - the task doesn't really care about the technical implementation of how it deletes the notifications	2021-12-14 15:24:34 +00:00
Ben Thorner	c8cf057eba	Record providers we time out notifications for This will help us monitor issues with delivery receipts and keep track of provider performance over time. I'm not concerned about performance here: - The number of notifications to time out is usually small. - This task only runs once a day. - Calls to StatsD are quick and cheap.	2021-12-14 13:04:39 +00:00
Ben Thorner	87cd40d00a	Scale timeout task to work on arbitrary volumes Previously this was limited to 500K notifications. While we don't expect to reach this limit, it's not impossible e.g. if we had a repeat of the incident where one of our providers stopped sending us status updates. Although that's not great, it's worse if our code can't cope with the unexpectedly high volume. This reuses the technique we have elsewhere [1] to keep processing in batches until there's nothing left. Specifying a cutoff point means the total amount of work to do can't keep growing. [1]: `2fb432adaf/app/dao/notifications_dao.py (L441)`	2021-12-13 17:14:28 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	ab4cb029df	Remove alert for email / sms in created In response to [1]. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988 It turns out the code that inspired this new alert - in the old "timeout-sending-notifications" task - was actually redundant as we already have a task to "replay" notifications still in "created", which is much better than just alerting about them. It's possible the replayed notifications will also fail, but in both cases we should see some kind of error due to this, so I don't think we're losing anything by not having an alert.	2021-12-06 14:11:42 +00:00
Ben Thorner	04da017558	DRY-up conditionally creating callback tasks This removes 3 duplicate instances of the same code, which is still tested implicitly via test_process_ses_receipt_tasks [1]. In the next commit we'll make this test more explicit, to reflect that it's now being reused elsewhere and shouldn't change arbitrarily. We do lose the "print" statement from the command instance of the code, but I think that's a very tolerable loss. [1]: `16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)`	2021-12-06 14:11:34 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00

1 2 3 4 5 ...

1117 Commits