notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-20 23:41:17 -05:00

Author	SHA1	Message	Date
Kenneth Kehl	359ac9d967	merge from main	2023-05-10 09:58:03 -07:00
Kenneth Kehl	08c1ad75c8	notify-260 remove server-side timezone handling	2023-05-10 08:39:50 -07:00
Kenneth Kehl	3fb113a83e	notify-152 sms delivery receipts	2023-05-04 07:56:24 -07:00
Steven Reilly	ff4190a8eb	Remove letters-related code (#175 ) This deletes a big ol' chunk of code related to letters. It's not everything—there are still a few things that might be tied to sms/email—but it's the the heart of letters function. SMS and email function should be untouched by this. Areas affected: - Things obviously about letters - PDF tasks, used for precompiling letters - Virus scanning, used for those PDFs - FTP, used to send letters to the printer - Postage stuff	2023-03-02 20:20:31 -05:00
Ryan Ahearn	041cd08097	Clean up more mmg and firetext references	2022-12-22 09:31:12 -05:00
Ryan Ahearn	45c3e3c277	Remove unused `is_delivery_slow_for_providers` method	2022-11-30 13:50:49 -05:00
stvnrlly	9e7ee1c0f8	migrate bst_date to local_date	2022-11-21 11:49:59 -05:00
stvnrlly	213f699c99	time adjustments in tests	2022-11-14 14:23:54 -05:00
stvnrlly	b50cb4712f	tz utility swap and many test updates	2022-11-10 12:33:25 -05:00
stvnrlly	e9fdfd59f4	clean flake8 except provider code	2022-10-19 16:16:26 +00:00
jimmoffet	f1aec54665	clean up comments and method dupes	2022-09-15 15:48:37 -07:00
jimmoffet	b0f819dbd9	canada UK ses callbacks monster mash	2022-09-15 14:59:13 -07:00
Christa Hartsock	64b30feb08	Remove pytest from non-test file	2022-07-07 16:22:21 -07:00
Christa Hartsock	af6495cd4c	Get tests passing locally When we cloned the repository and started making modifications, we didn't initially keep tests in step. This commit tries to get us to a clean test run by skipping tests that are failing and removing some that we no longer expect to use (MMG, Firetext), with the intention that we will come back in future and update or remove them as appropriate. To find all tests skipped, search for `@pytest.mark.skip(reason="Needs updating for TTS:`. There will be a brief description of the work that needs to be done to get them passing, if known. Delete that line to make them run in a standard test run (`make test`).	2022-07-07 15:41:15 -07:00
Jim Moffet	aa4ec532a4	implement SNS	2022-06-17 11:16:23 -07:00
Ben Thorner	ed379a3724	Fix out-of-date rows in ft_notification_status This can happen in the following scenario (primarily for letters): 1. A service has a mixture of "delivered" and "sending" letters, which the status task aggregates into two rows: sending \| 123 delivered \| 456 2. After the 7 day retention has passed, only the "delivered" letters will be archived [^1]. 3. The status task now looks at the history table [^2], which means it only sees the "delivered" letters. 4. The "sending" letters are eventually "delivered" and archived (before the 10 day aggregation cutoff). 5. But the status aggregation task doesn't run. This commit fixes (5). [^1]: https://github.com/alphagov/notifications-api/pull/3063 [^2]: `f87ebb094d/app/dao/fact_notification_status_dao.py (L51)`	2022-05-11 11:04:56 +01:00
Ben Thorner	7bffe9ee50	Log Service ID when we get a duplicate receipt This will make it easier to group these logs if a service complains about the issue.	2022-03-21 15:43:08 +00:00
Ben Thorner	966c4db8c6	Fix getting service IDs for status aggregation Addresses [1]. Previously the query would always use UTC midnight, even after we had switched to BST (+1h). We store timestamps as naive UTC in our DB - without a timezone - but we want the query to work in terms of GMT / BST so we adjust for that - BST midnight is 11PM in UTC. [1]: https://github.com/alphagov/notifications-api/pull/3437#discussion_r791998690	2022-02-10 10:51:45 +00:00
Ben Thorner	7f4b140f97	Rename function to make it consistent This is consistent with the new "on_date" function. It was going off the edge of my screen before in some parts of the code.	2022-02-09 17:39:08 +00:00
Ben Thorner	1213463b8e	Only aggregate status when necessary for a service This takes a similar approach to the nightly deletion task so that we only create sub-tasks when there are actually notifications to aggregate for a given type and day [1]. We're making this change to stop the duplication errors we're getting at the moment and ensure the task can scale to more messages and more services. There are two parts to this: - Each subtask should now run within the 5 minute visibility timeout. However, they may still be duplicated if the parent task overruns [2]. - The parent task creates a mininal number of subtasks, and the query to determine this is very fast for a normal process day (milliseconds). Since all tasks will run quickly, there should be no more duplication. In order to test this more nuanced task, I rewrote the tests: - One test checks the subtask is called correctly. - One test checks we create all the right subtasks. [1]: https://github.com/alphagov/notifications-api/pull/3381 [2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl	2022-02-09 17:39:07 +00:00
David McDonald	1d8fafcdf4	Remove unused functions Can't see these being used anywhere so lets get rid of them	2022-02-07 15:58:04 +00:00
Rebecca Law	6cd7a23d3c	If there is an invalid letter that has not been updated to `validation-failed` because the `update-validation-failed-for-templated-letter` has not been picked up off the letter-tasks queue and the `collate-letter-pdfs-to-be-sent` has started. 1. The number of letters that we send to DVLA will be not be correct (see `20ead82463/app/celery/letters_pdf_tasks.py (L136)`) This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported. 2. When we get the PDF from S3 we will get a file not found `20ead82463/app/celery/letters_pdf_tasks.py (L244)` The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions. Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.	2022-01-19 08:31:19 +00:00
Ben Thorner	de9ae08ecc	Downgrade log for letter deletion exceptions If the S3 object is missing [1], then that's what we want, so we don't need such a severe log for it, but we still want to know as it's not expected. This is separate to more general "ClientError" exceptions, which could mean anything. There weren't any tests to cover missing S3 objects, so I've added one. I don't think we need a test for ClientErrors: - If there was no handler, the task would fail and we'd learn about it that way. - The scope of the calling task is now much smaller, so it matters less than it used to [2]. [1]: `81a79e56ce/app/letters/utils.py (L52)` [2]: `f965322f25`	2021-12-20 12:45:48 +00:00
Leo Hemsted	49cc1b643f	split delete task up into per service we really don't gain anything by running each service delete in sequence - we get the services, and then just loop through them deleting per service. By deleting per service in separate tasks, we can take advantage of parallelism. the only thing we lose is some log lines but I don't think we're that interested in them. only set query limit at the move_notifications dao function - the task doesn't really care about the technical implementation of how it deletes the notifications	2021-12-14 15:24:34 +00:00
Ben Thorner	11278c47f5	Replace log with StatsD gauge for slow delivery A gauge is more useful as we can visualise it and combine it with other stats - we already have other stats for the total number of notifications sent by provider, and we can extrapolate the number of slow notifications using this, if needed. We also still have logs to say the task is running, as well as a log in the calling code when we actually make a switch [1], so we're not losing anything by removing the log here. [1]: `a9306c4557/app/celery/scheduled_tasks.py (L117)`	2021-12-14 13:03:43 +00:00
Ben Thorner	2adaaac3ae	Remove redundant conditions for update query Filtering by ID is enough, noting the other conditions were the same between both queries.	2021-12-13 17:03:07 +00:00
Ben Thorner	c8ebb365d4	Make limit of DAO timeout function more obvious We're going to iterate how we use the function with a limit, so we shouldn't say it's "temporary" anymore. We don't need to change the default, but having it in the function parameters makes it easier to see the funtion doesn't time out all notifications, just some.	2021-12-13 17:01:41 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	3bcaf8330e	Simplify comment for DAO timeout function	2021-12-13 16:39:55 +00:00
Ben Thorner	2fb432adaf	Merge pull request #3383 from alphagov/email-sms-created-alert-180344153 Add new log / alert for 'created' email / SMS	2021-12-13 12:56:05 +00:00
David McDonald	7d8eed8228	Optimise queries run for creating pagination links We have been running in to the problem in pallets/flask-sqlalchemy#518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`. This commit is analagous to `c68d1a2f23` The only difference is that in that case, the pagination links are used to show prev and/or next links in the admin app. In this case, the pagination links are only used to see if there is a page 2, and if there is, say that we are only showing the first 50 results.	2021-12-10 17:47:27 +00:00
David McDonald	1973994516	Merge pull request #3391 from alphagov/pagination-approach-change Pagination approach change for `get_notifications_for_service`	2021-12-09 10:43:14 +00:00
Ben Thorner	ab4cb029df	Remove alert for email / sms in created In response to [1]. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988 It turns out the code that inspired this new alert - in the old "timeout-sending-notifications" task - was actually redundant as we already have a task to "replay" notifications still in "created", which is much better than just alerting about them. It's possible the replayed notifications will also fail, but in both cases we should see some kind of error due to this, so I don't think we're losing anything by not having an alert.	2021-12-06 14:11:42 +00:00
Ben Thorner	97b58ed4c3	Remove unnecessary _timeout partial function It's no longer necessary to have a separate function that's now only called once. While sometimes the separation can bring clarity, here I think it's clearer to have all the code in one place, and avoid the functools complexity we had before.	2021-12-06 14:00:37 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00
Ben Thorner	f96ba5361a	Add new task to alert about created email / sms This will log an error when email or SMS notifications have been stuck in 'created' for too long - normally they should be 'sending' in seconds, noting that we have a goal of < 10s wait time for most notifications being processed our platform. In the next commits we'll decouple similar functionality from the existing 'timeout-sending-notifications' task.	2021-12-06 14:00:31 +00:00
David McDonald	c68d1a2f23	Optimise queries run for creating pagination links We have been running in to the problem in https://github.com/pallets/flask-sqlalchemy/issues/518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`.	2021-12-03 17:32:39 +00:00
Leo Hemsted	6bbec9f103	make delete notification tasks parallel by notification type we used to do this until apr 2020. Let's try doing it again. Back then, we had problems with timing. We did two things in spring 2020: We moved to using an intermediary temp table [1] We stopped the tasks being parallelised [2] However, it turned out the real time saving was from changing what services we delete for [3]. The task was actually CPU-bound rather than DB-bound, so that's probably why having the tasks in parallel wasn't helping, since they were all competing for the same CPU. It's worth trying the parallel steps again now that we're no longer CPU bound. Note: Temporary tables are in their own postgres schema, and are only viewable by the current session (session == connection. Each celery worker process has its own db connection). We don't need to worry about separate workers both trying to use the same table at once. I've also added a "DROP ON COMMIT" directive to the table definition just to ensure it doesn't persist past the task even if there's an exception. (This also drops on rollback). Cronitor looks at the three functions separately so we don't need to worry about the main task taking milliseconds where it used to take hours as it isn't monitored itself. I've also removed some unnecessary redundant exception logs. [1] https://github.com/alphagov/notifications-api/pull/2767 [2] https://github.com/alphagov/notifications-api/pull/2798 [3] https://github.com/alphagov/notifications-api/pull/3381	2021-12-01 14:28:08 +00:00
Rebecca Law	101498ec84	Improve query performance Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.	2021-11-30 16:42:32 +00:00
Leo Hemsted	bab659c677	reduce number of services we try and delete notifications for TLDR: Don't return as many services, and only return their IDs and not the whole service objects. Context: the delete notifications nightly task has been taking longer and longer, and to delete all three notification types in sequence it now takes up to 8 hours. This is because we were retrieving all services, loading them into memory on the worker, and then trying to delete notifications for each service in turn. While it does use a fair chunk of IOPS/CPU on our postgres db, we're not anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1] The real issue appears to be that the task is CPU bound on the periodic worker - we see the worker spike up to 100% CPU regularly across the whole 3am-11am period. We also noticed that for each notification type the task first processes services with custom data retention (not many but some of the biggest users), then deals with all other services. We can see from looking at kibana that, for example, the task starts at 3am, and the custom data retention service email deletions are finished by 3:12am. The rest of the emails don't get deleted until 5am, so we knew that the problem is with how it handles the other services. There are currently 17000 services in the database. On a typical day, ~800 services will have notifications that are over 7 days old and need to be deleted. By only returning these services, we reduce the amount of data transfer and serialisation that needs to happen. It takes about two minutes to retrieve the distinct service ids from the notifications table for sms notifications, but that is only 5% the size of the full list so cuts down on a lot of processing Also, by only returning service_ids rather than the whole `Service` model we avoid sqlalchemy needing to do lots of data serialisation, when we were only using the `Service.id` field from that result anyway. [1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics [2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now [3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))	2021-11-24 16:18:40 +00:00
Rebecca Law	30a5852685	Update the query to only return the count from the table since that is all we care about. https://www.pivotaltracker.com/story/show/180262357	2021-11-17 14:46:52 +00:00
David McDonald	c98996a461	Improve log message searchability for duplicate receipts There were two problems with the existing message. 1. There was no space between the new status and the time taken which made reading and searching harder 2. They key bits of information (before and after status) were separated by the time taken (which will always be unique) meaning you couldn't do an easy search for a message that is say in delivered being attempted to be set to temporary-failure.	2021-11-12 14:06:38 +00:00
Ben Thorner	77c8c0a501	Optimise query to get notifications to "time out" From experimenting in production we found a "!=" caused the engine to use a sequential scan, whereas explicitly listing all the types ensured an index scan was used. We also found that querying for many (over 100K) items leads to the task stalling - no logs, but no evidence of it running either - so we also add a limit to the query. Since the query now only returns a subset of notifications, we need to ensure the subsequent "update" query operates on the same batch. Also, as a temporary measure, we have a loop in the task code to ensure it operates on the total set of notifications to "time out", which we assume is less than 500K for the time being.	2021-11-09 13:50:32 +00:00
Katie Smith	8a34dccda0	Remove redundant join This was left over from when we needed to tell if a notification was sent by a crown or non-crown service.	2021-05-06 09:34:46 +01:00
Rebecca Law	4f196316aa	Change the query to get the services to purge to use query on the db.Model rather than db.session.query. `service_ids_to_purge` is a list of `row` object rather than a list of `UUID`. NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.	2021-04-29 13:32:36 +01:00
Rebecca Law	85895a9e8b	Revert "Scheduled weekly dependency update for week 16"	2021-04-28 10:17:16 +01:00
Rebecca Law	f941768d8c	Change the query to get the services to purge to use query on the db.Model rather than db.session.query. `service_ids_to_purge` is a list of `row` object rather than a list of `UUID`. NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.	2021-04-27 08:36:34 +01:00
Rebecca Law	d4009ffc52	Rename database management functions. Rename @transactional to @autocommit. Rename nested_transaction to tranaction.	2021-04-19 10:56:00 +01:00
Ben Thorner	c76e789f1e	Reduce extra S3 ops when working with letter PDFs Previously we did some unnecessary work: - Collate task. This had one S3 request to get a summary of the object, which was then used in another request to get the full object. We only need the size of the object, which is included in the summary [1]. - Archive task. This had one S3 request to get a summary of the object, which was then used to make another request to delete it. We still need both requests, but we can remove the S3.Object in the middle. [1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary	2021-03-16 12:53:13 +00:00
Ben Thorner	ff7eebc90a	Simplify deleting old letters Previously we made a call to S3 to list objects for a letter, even though we already had the precise key of the single object to hand. This removes the one usage of "get_s3_bucket_objects" and uses the filename directly in the call to remove the object.	2021-03-15 17:18:20 +00:00

1 2 3 4 5 ...

423 Commits