notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-22 08:21:13 -05:00

Author	SHA1	Message	Date
Ben Thorner	a9306c4557	Merge pull request #3398 from alphagov/infinity-timeout-180344153 Scale timeout task to work on arbitrary volumes	2021-12-14 11:21:26 +00:00
Ben Thorner	c1f0c24d82	Trim down tests for DAO timeout function a bit The first test is enough to cover that "created" and "delivered" notifications aren't affected by this function.	2021-12-13 17:17:41 +00:00
Ben Thorner	87cd40d00a	Scale timeout task to work on arbitrary volumes Previously this was limited to 500K notifications. While we don't expect to reach this limit, it's not impossible e.g. if we had a repeat of the incident where one of our providers stopped sending us status updates. Although that's not great, it's worse if our code can't cope with the unexpectedly high volume. This reuses the technique we have elsewhere [1] to keep processing in batches until there's nothing left. Specifying a cutoff point means the total amount of work to do can't keep growing. [1]: `2fb432adaf/app/dao/notifications_dao.py (L441)`	2021-12-13 17:14:28 +00:00
Ben Thorner	2adaaac3ae	Remove redundant conditions for update query Filtering by ID is enough, noting the other conditions were the same between both queries.	2021-12-13 17:03:07 +00:00
Ben Thorner	c8ebb365d4	Make limit of DAO timeout function more obvious We're going to iterate how we use the function with a limit, so we shouldn't say it's "temporary" anymore. We don't need to change the default, but having it in the function parameters makes it easier to see the funtion doesn't time out all notifications, just some.	2021-12-13 17:01:41 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	b81a66da50	Fix assertions in tests for timeout DAO function Previously most of the assertions were being run before we had actually called the function. There was also a redundant block of assertions that just asserted the initial state of the test data.	2021-12-13 16:48:30 +00:00
Ben Thorner	3bcaf8330e	Simplify comment for DAO timeout function	2021-12-13 16:39:55 +00:00
Ben Thorner	2fb432adaf	Merge pull request #3383 from alphagov/email-sms-created-alert-180344153 Add new log / alert for 'created' email / SMS	2021-12-13 12:56:05 +00:00
David McDonald	c25585fd60	Merge pull request #3394 from alphagov/dont-count-pages Improve response times for figuring out pagination links by doing it ourselves rather than use Flask-Sqlalchemy	2021-12-13 11:43:02 +00:00
David McDonald	eba625a9f5	Merge pull request #3389 from alphagov/improve-pagination-queries Set `count_pages` as False to stop running of redundant query	2021-12-13 11:41:33 +00:00
Ben Thorner	6fe4cd932a	Merge pull request #3396 from alphagov/bump-utils-177535141 Bump utils to 51.2.1	2021-12-13 09:58:29 +00:00
David McDonald	7d8eed8228	Optimise queries run for creating pagination links We have been running in to the problem in pallets/flask-sqlalchemy#518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`. This commit is analagous to `c68d1a2f23` The only difference is that in that case, the pagination links are used to show prev and/or next links in the admin app. In this case, the pagination links are only used to see if there is a page 2, and if there is, say that we are only showing the first 50 results.	2021-12-10 17:47:27 +00:00
Ben Thorner	a7560af9c4	Bump utils to 51.2.1 This includes performance improvements for RecipientCSV, which may reduce the processing time in some edge cases - this depends on if the Admin app rejects CSVs with these edge cases.	2021-12-10 16:38:28 +00:00
David McDonald	edadeb9131	Use `get_prev_next_pagination_links` when searching by to field The only change in behaviour is that we are no longer including a `last` pagination link. This is OK because the frontend doesnt use it, just the prev and next links as per https://github.com/alphagov/notifications-admin/blob/master/app/main/views/jobs.py#L248	2021-12-10 12:29:55 +00:00
David McDonald	6ac4e67f78	Add test for pagination behaviour We already have a test case for over 50 results, but this adds one for 50 (ie a single page of results or less)	2021-12-10 12:29:12 +00:00
David McDonald	ec6ed3958c	Move `get_prev_next_pagination_links` to utils This will mean it can later be reused whereever we want	2021-12-10 12:26:57 +00:00
Ben Thorner	a8edfeb941	Remove command to replay callbacks In response to [1]. I've already removed the runbook that referred to this. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r765644576	2021-12-09 10:46:19 +00:00
David McDonald	1973994516	Merge pull request #3391 from alphagov/pagination-approach-change Pagination approach change for `get_notifications_for_service`	2021-12-09 10:43:14 +00:00
Chris Hill-Scott	0481b14803	Merge pull request #3392 from alphagov/bump-util-51 Bump notifications-utils to 51.0.0	2021-12-06 16:51:38 +00:00
Ben Thorner	ab4cb029df	Remove alert for email / sms in created In response to [1]. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988 It turns out the code that inspired this new alert - in the old "timeout-sending-notifications" task - was actually redundant as we already have a task to "replay" notifications still in "created", which is much better than just alerting about them. It's possible the replayed notifications will also fail, but in both cases we should see some kind of error due to this, so I don't think we're losing anything by not having an alert.	2021-12-06 14:11:42 +00:00
Ben Thorner	9bd2a9b427	Extract tests for conditionally creating callback This will help ensure the function doesn't change arbitrarily, now that it's used in multiple other places.	2021-12-06 14:11:41 +00:00
Ben Thorner	04da017558	DRY-up conditionally creating callback tasks This removes 3 duplicate instances of the same code, which is still tested implicitly via test_process_ses_receipt_tasks [1]. In the next commit we'll make this test more explicit, to reflect that it's now being reused elsewhere and shouldn't change arbitrarily. We do lose the "print" statement from the command instance of the code, but I think that's a very tolerable loss. [1]: `16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)`	2021-12-06 14:11:34 +00:00
Ben Thorner	aea555fce2	Make test for timeout with callbacks consistent This now matches the behaviour of the test above it: mocking out the DAO function in order to focus on the specific behaviour of the function under test.	2021-12-06 14:00:42 +00:00
Ben Thorner	5bf3fe6c0f	Clarify no callbacks are sent in timeout test This now complements the test below it, which we will refactor to be consistent in the next commit.	2021-12-06 14:00:41 +00:00
Ben Thorner	8b7e81958d	Delete duplicate 'timeout' tests for notifications These scenarios are already covered by the DAO tests. It's enough to just check the DAO function is called as expected. While sometimes it can be better to have more end-to-end tests, the convention across much of this app is to do unit tests.	2021-12-06 14:00:40 +00:00
Ben Thorner	c3e11d676f	Remove unnecessary test_request_context manager This doesn't affect how the tests run and just adds complexity.	2021-12-06 14:00:39 +00:00
Ben Thorner	05bd26d444	Fix names for a few tests in test_nightly_tasks.py I find it really difficult to visually parse test files unless we have a consistent convention for how we name our test functions. In most of our tests the name of the test function starts with the name of the function under test.	2021-12-06 14:00:38 +00:00
Ben Thorner	97b58ed4c3	Remove unnecessary _timeout partial function It's no longer necessary to have a separate function that's now only called once. While sometimes the separation can bring clarity, here I think it's clearer to have all the code in one place, and avoid the functools complexity we had before.	2021-12-06 14:00:37 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00
Ben Thorner	2acc4ee67d	Repurpose command to replay notification callbacks This is so we can use it to address issues highlighted by the new alert, if it's not possible to actually send the notifications e.g. if they are somehow 'invalid'. Previously this was added for a one-off use case [1]. This rewrites the task to operate on arbitrary notification IDs instead of client refs, which aren't always present for notifications we may want to send / replay callbacks for. Since the task may now need to work on notifications more than one service, I had to restructure it to cope with multiple callback APIs. Note that, in the test, I've chosen to do a chain of invocations and assertions, rather than duplicate a load of boilerplate or introduce funky parametrize flags for a service with/out a callback API. We'll refactor this in a later commit. [1]: `e95740a6b5`	2021-12-06 14:00:35 +00:00
Ben Thorner	f96ba5361a	Add new task to alert about created email / sms This will log an error when email or SMS notifications have been stuck in 'created' for too long - normally they should be 'sending' in seconds, noting that we have a goal of < 10s wait time for most notifications being processed our platform. In the next commits we'll decouple similar functionality from the existing 'timeout-sending-notifications' task.	2021-12-06 14:00:31 +00:00
Chris Hill-Scott	f011254667	Bump notifications-utils to 51.0.0 Just so other people don’t have to merge these changes. The breaking changes don’t affect this repo because the API doesn’t: - check the service guestlist before sending a message - do any visual preview of emergency alert messages > 51.0.0 > - Initial argument to RecipientCSV renamed from whitelist to guestlist, in other words consuming code should call RecipientCSV(guestlist=['test@example.com']) > - RecipientCSV.whitelist property renamed to RecipientCSV.guestlist > > 50.0.0 > - Make icon in broadcast_preview_template.jinja2 an inline SVG (requires changes to the CSS of consumer code) > > 49.1.0 > Add ttl_in_seconds argument to RequestCache.set to let users specify a custom TTL This commit also changes the format of the line in the requirements file, copying https://github.com/alphagov/notifications-admin/pull/4074/files	2021-12-06 09:34:15 +00:00
David McDonald	e8dd136678	Document area that may be doing pagination links when not needed	2021-12-03 17:32:40 +00:00
David McDonald	c68d1a2f23	Optimise queries run for creating pagination links We have been running in to the problem in https://github.com/pallets/flask-sqlalchemy/issues/518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`.	2021-12-03 17:32:39 +00:00
David McDonald	989ef9c21a	Remove `last` and `total` keys from pagination links These don't appear to be used anywhere in the admin app and this route is only used by the admin app. Therefore it is safe to remove them. We remove them because the calculate the total number of notifications or the final page number of results can be particularly slow for services with many many notifications, for example 100 seconds for a service with 500k notifications sent in the past 7 days. Given neither are being used, this will give us the potential in the next commit to reduce the number of slow queries and improve page load times. Note, I've kept the scope small by only introducing the new pagination function for this one endpoint but there could be scope in future to get all pagination using the next function if appropriate.	2021-12-03 17:26:49 +00:00
David McDonald	a62e63fcef	Add tests for existing pagination behaviour No functionality change, just documenting what already exists	2021-12-03 17:21:14 +00:00
Leo Hemsted	a8cad79def	Merge pull request #3390 from alphagov/reporting-q put delete tasks on the reporting worker	2021-12-03 16:08:42 +00:00
Leo Hemsted	f6d210f1e6	put delete tasks on the reporting worker they share a lot with the reporting tasks (creating ft_billing and ft_notification_status), in that they're run nightly, take a long time, and we see error messages if they get run multiple times (due to visibility timeout). The periodic app has two concurrent processes - previously there was just one delete task, which would use one of those processes, while the other process would pick up anything else on the queue (at that time of night, the regular provider switch checks and scheduled job checks). However, when we switched to running the three delete notification types separately, we saw visibility timeout issues - three tasks would be created, all three would be picked up by one celery instance, the two worker processes would start on two of them, and the third would sit on the box, wait longer than the visibility timeout to be picked up (and acknowledged), and so SQS would assume the task was lost and replay it. it's queues all the way down! By putting them on the reporting worker we can take advantage of tuning that app (for example setting the prefetch multiplier to one) which is designed to run large tasks. We've also got more concurrent workers on this box, so we can run all three tasks at once.	2021-12-03 13:28:16 +00:00
Leo Hemsted	595ff134d7	Merge pull request #3388 from alphagov/parallel-deletes make delete notification tasks parallel by notification type	2021-12-02 10:33:53 +00:00
David McDonald	ad274ee887	Set `count_pages` as False to stop running of redundant query This is a similar PR to https://github.com/alphagov/notifications-api/pull/2284. When using flask-sqlalchemy to get a `Pagination` object, by default it will run two queries 1. Get the page of results that you are asking for 2. If the number of results is equal to the page size, then it will issue a second query that will count the total number of results Getting the total number of results is only useful if - you need to show how many results there are - you need to know if there is a next page of results (flask-sqlalchemy uses the total to work out how many pages there are altogether, which may not be the most efficient way of working out if there is a next page or not but that is what it currently does). Looking at the `get_notifications` route, it does not use `paginated_notifications.total` or `paginated_notifications.has_next` and therefore we have no use for the second query to get the total number of results. We can stop this additional query by setting `count_pages=False` which will hopefully give us some performance improvements, in particular for services which send a lot of notifications. Flask sqlalchemy references: `818c947b66/src/flask_sqlalchemy/__init__.py (L478)` `818c947b66/src/flask_sqlalchemy/__init__.py (L399)` Note, I have checked the other uses of `get_notifications_for_service` and the other cases are currently using the total or next page so this approach is not something we can take with them.	2021-12-01 16:51:05 +00:00
Leo Hemsted	6bbec9f103	make delete notification tasks parallel by notification type we used to do this until apr 2020. Let's try doing it again. Back then, we had problems with timing. We did two things in spring 2020: We moved to using an intermediary temp table [1] We stopped the tasks being parallelised [2] However, it turned out the real time saving was from changing what services we delete for [3]. The task was actually CPU-bound rather than DB-bound, so that's probably why having the tasks in parallel wasn't helping, since they were all competing for the same CPU. It's worth trying the parallel steps again now that we're no longer CPU bound. Note: Temporary tables are in their own postgres schema, and are only viewable by the current session (session == connection. Each celery worker process has its own db connection). We don't need to worry about separate workers both trying to use the same table at once. I've also added a "DROP ON COMMIT" directive to the table definition just to ensure it doesn't persist past the task even if there's an exception. (This also drops on rollback). Cronitor looks at the three functions separately so we don't need to worry about the main task taking milliseconds where it used to take hours as it isn't monitored itself. I've also removed some unnecessary redundant exception logs. [1] https://github.com/alphagov/notifications-api/pull/2767 [2] https://github.com/alphagov/notifications-api/pull/2798 [3] https://github.com/alphagov/notifications-api/pull/3381	2021-12-01 14:28:08 +00:00
Ben Thorner	6435b57cd1	Merge pull request #3384 from alphagov/fix-cronitor-test-180344153 Fix flakey Cronitor test using caplog fixture	2021-12-01 12:44:27 +00:00
Rebecca Law	c78e6c8571	Merge pull request #3386 from alphagov/increase-concurrency-for-reporting-app Increase the concurrency for the delivery-worker-reporting	2021-12-01 11:51:50 +00:00
Rebecca Law	ddc03a9f5c	Merge pull request #3385 from alphagov/improve-is_provider_slow-query Improve query performance	2021-12-01 11:41:51 +00:00
Rebecca Law	e7efeec309	Increase the concurrency for the delivery-worker-reporting TL;DR After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term. Longer read: Every night we have 2 "reporting" tasks that run. - create-nightly-billing starts at 00:15 - populates data for ft_billing for the previous days. - 4 days for email - 4 days for sms - 10 days for letters - create-nightly-notification-status starts at 00:30 - populates data for ft_notification - 4 days for email - 4 days for sms - 10 days for letters These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2. This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30. Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task. If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task. At some point all the tasks are completed, some are completed twice.	2021-12-01 11:40:18 +00:00
Rebecca Law	101498ec84	Improve query performance Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.	2021-11-30 16:42:32 +00:00
Katie Smith	ad313065bf	Merge pull request #3368 from alphagov/org-agreement-details Add command to populate organisation table with agreement details	2021-11-30 14:30:56 +00:00
Katie Smith	250ce38cf2	Remove unecessary list-routes command Since this was added, Flask now comes with a build in command to list the routes, `flask routes`, so this is not needed.	2021-11-30 11:11:49 +00:00
Katie Smith	6d9f2c27d9	Add command to populate organisation table with agreement details When we first started recording the details of the agreements that were signed by organisations, we stored a copy of the signed agreement in Google drive. Later, we switched to storing the details in the database instead. This adds a command which is designed to be run once and which updates the database for the organisations which had the details of who accepted the agreement and when stored in Google drive.	2021-11-30 11:11:49 +00:00

1 2 3 4 5 ...

8564 Commits