Commit Graph

8564 Commits

Author SHA1 Message Date
Ben Thorner
a9306c4557 Merge pull request #3398 from alphagov/infinity-timeout-180344153
Scale timeout task to work on arbitrary volumes
2021-12-14 11:21:26 +00:00
Ben Thorner
c1f0c24d82 Trim down tests for DAO timeout function a bit
The first test is enough to cover that "created" and "delivered"
notifications aren't affected by this function.
2021-12-13 17:17:41 +00:00
Ben Thorner
87cd40d00a Scale timeout task to work on arbitrary volumes
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.

This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.

[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
2021-12-13 17:14:28 +00:00
Ben Thorner
2adaaac3ae Remove redundant conditions for update query
Filtering by ID is enough, noting the other conditions were the
same between both queries.
2021-12-13 17:03:07 +00:00
Ben Thorner
c8ebb365d4 Make limit of DAO timeout function more obvious
We're going to iterate how we use the function with a limit, so we
shouldn't say it's "temporary" anymore. We don't need to change the
default, but having it in the function parameters makes it easier to
see the funtion doesn't time out all notifications, just some.
2021-12-13 17:01:41 +00:00
Ben Thorner
76aeab24ce Rewrite DAO timeout method to take cutoff_time
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
2021-12-13 16:56:21 +00:00
Ben Thorner
b81a66da50 Fix assertions in tests for timeout DAO function
Previously most of the assertions were being run *before* we had
actually called the function. There was also a redundant block of
assertions that just asserted the initial state of the test data.
2021-12-13 16:48:30 +00:00
Ben Thorner
3bcaf8330e Simplify comment for DAO timeout function 2021-12-13 16:39:55 +00:00
Ben Thorner
2fb432adaf Merge pull request #3383 from alphagov/email-sms-created-alert-180344153
Add new log / alert for 'created' email / SMS
2021-12-13 12:56:05 +00:00
David McDonald
c25585fd60 Merge pull request #3394 from alphagov/dont-count-pages
Improve response times for figuring out pagination links by doing it ourselves rather than use Flask-Sqlalchemy
2021-12-13 11:43:02 +00:00
David McDonald
eba625a9f5 Merge pull request #3389 from alphagov/improve-pagination-queries
Set `count_pages` as False to stop running of redundant query
2021-12-13 11:41:33 +00:00
Ben Thorner
6fe4cd932a Merge pull request #3396 from alphagov/bump-utils-177535141
Bump utils to 51.2.1
2021-12-13 09:58:29 +00:00
David McDonald
7d8eed8228 Optimise queries run for creating pagination links
We have been running in to the problem in
pallets/flask-sqlalchemy#518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.

As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.

This commit is analagous to
c68d1a2f23

The only difference is that in that case, the pagination links are
used to show prev and/or next links in the admin app. In this case,
the pagination links are only used to see if there is a page 2, and
if there is, say that we are only showing the first 50 results.
2021-12-10 17:47:27 +00:00
Ben Thorner
a7560af9c4 Bump utils to 51.2.1
This includes performance improvements for RecipientCSV, which may
reduce the processing time in some edge cases - this depends on if
the Admin app rejects CSVs with these edge cases.
2021-12-10 16:38:28 +00:00
David McDonald
edadeb9131 Use get_prev_next_pagination_links when searching by to field
The only change in behaviour is that we are no longer including a
`last` pagination link.

This is OK because the frontend doesnt use it, just the prev and
next links as per
https://github.com/alphagov/notifications-admin/blob/master/app/main/views/jobs.py#L248
2021-12-10 12:29:55 +00:00
David McDonald
6ac4e67f78 Add test for pagination behaviour
We already have a test case for over 50 results, but this adds
one for 50 (ie a single page of results or less)
2021-12-10 12:29:12 +00:00
David McDonald
ec6ed3958c Move get_prev_next_pagination_links to utils
This will mean it can later be reused whereever we want
2021-12-10 12:26:57 +00:00
Ben Thorner
a8edfeb941 Remove command to replay callbacks
In response to [1].

I've already removed the runbook that referred to this.

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r765644576
2021-12-09 10:46:19 +00:00
David McDonald
1973994516 Merge pull request #3391 from alphagov/pagination-approach-change
Pagination approach change for `get_notifications_for_service`
2021-12-09 10:43:14 +00:00
Chris Hill-Scott
0481b14803 Merge pull request #3392 from alphagov/bump-util-51
Bump notifications-utils to 51.0.0
2021-12-06 16:51:38 +00:00
Ben Thorner
ab4cb029df Remove alert for email / sms in created
In response to [1].

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988

It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.

It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
2021-12-06 14:11:42 +00:00
Ben Thorner
9bd2a9b427 Extract tests for conditionally creating callback
This will help ensure the function doesn't change arbitrarily, now
that it's used in multiple other places.
2021-12-06 14:11:41 +00:00
Ben Thorner
04da017558 DRY-up conditionally creating callback tasks
This removes 3 duplicate instances of the same code, which is still
tested implicitly via test_process_ses_receipt_tasks [1]. In the
next commit we'll make this test more explicit, to reflect that it's
now being reused elsewhere and shouldn't change arbitrarily.

We do lose the "print" statement from the command instance of the
code, but I think that's a very tolerable loss.

[1]: 16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)
2021-12-06 14:11:34 +00:00
Ben Thorner
aea555fce2 Make test for timeout with callbacks consistent
This now matches the behaviour of the test above it: mocking out
the DAO function in order to focus on the specific behaviour of
the function under test.
2021-12-06 14:00:42 +00:00
Ben Thorner
5bf3fe6c0f Clarify no callbacks are sent in timeout test
This now complements the test below it, which we will refactor to
be consistent in the next commit.
2021-12-06 14:00:41 +00:00
Ben Thorner
8b7e81958d Delete duplicate 'timeout' tests for notifications
These scenarios are already covered by the DAO tests. It's enough
to just check the DAO function is called as expected.

While sometimes it can be better to have more end-to-end tests, the
convention across much of this app is to do unit tests.
2021-12-06 14:00:40 +00:00
Ben Thorner
c3e11d676f Remove unnecessary test_request_context manager
This doesn't affect how the tests run and just adds complexity.
2021-12-06 14:00:39 +00:00
Ben Thorner
05bd26d444 Fix names for a few tests in test_nightly_tasks.py
I find it really difficult to visually parse test files unless we
have a consistent convention for how we name our test functions.
In most of our tests the name of the test function starts with the
name of the function under test.
2021-12-06 14:00:38 +00:00
Ben Thorner
97b58ed4c3 Remove unnecessary _timeout partial function
It's no longer necessary to have a separate function that's now
only called once. While sometimes the separation can bring clarity,
here I think it's clearer to have all the code in one place, and
avoid the functools complexity we had before.
2021-12-06 14:00:37 +00:00
Ben Thorner
0318229216 Stop 'timing out' old 'created' notifications
This is being replaced with a new alert and runbook [1]. It's not
always appropriate to change the status to 'technical-failure', and
the new alert means we'll act to fix the underlying issue promptly.

We'll look at tidying up the remaining code in the next commits.

[1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created
2021-12-06 14:00:36 +00:00
Ben Thorner
2acc4ee67d Repurpose command to replay notification callbacks
This is so we can use it to address issues highlighted by the new
alert, if it's not possible to actually send the notifications e.g.
if they are somehow 'invalid'.

Previously this was added for a one-off use case [1]. This rewrites
the task to operate on arbitrary notification IDs instead of client
refs, which aren't always present for notifications we may want to
send / replay callbacks for. Since the task may now need to work on
notifications more than one service, I had to restructure it to cope
with multiple callback APIs.

Note that, in the test, I've chosen to do a chain of invocations and
assertions, rather than duplicate a load of boilerplate or introduce
funky parametrize flags for a service with/out a callback API. We'll
refactor this in a later commit.

[1]: e95740a6b5
2021-12-06 14:00:35 +00:00
Ben Thorner
f96ba5361a Add new task to alert about created email / sms
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.

In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
2021-12-06 14:00:31 +00:00
Chris Hill-Scott
f011254667 Bump notifications-utils to 51.0.0
Just so other people don’t have to merge these changes.

The breaking changes don’t affect this repo because the API doesn’t:
- check the service guestlist before sending a message
- do any visual preview of emergency alert messages

> **51.0.0**
> - Initial argument to RecipientCSV renamed from whitelist to guestlist, in other words consuming code should call RecipientCSV(guestlist=['test@example.com'])
> - RecipientCSV.whitelist property renamed to RecipientCSV.guestlist
>
> **50.0.0**
> - Make icon in broadcast_preview_template.jinja2 an inline SVG (requires changes to the CSS of consumer code)
>
> **49.1.0**
> Add ttl_in_seconds argument to RequestCache.set to let users specify a custom TTL

This commit also changes the format of the line in the requirements
file, copying https://github.com/alphagov/notifications-admin/pull/4074/files
2021-12-06 09:34:15 +00:00
David McDonald
e8dd136678 Document area that may be doing pagination links when not needed 2021-12-03 17:32:40 +00:00
David McDonald
c68d1a2f23 Optimise queries run for creating pagination links
We have been running in to the problem in
https://github.com/pallets/flask-sqlalchemy/issues/518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.

As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
2021-12-03 17:32:39 +00:00
David McDonald
989ef9c21a Remove last and total keys from pagination links
These don't appear to be used anywhere in the admin app and this
route is only used by the admin app. Therefore it is safe to remove
them.

We remove them because the calculate the total number of notifications
or the final page number of results can be particularly slow for
services with many many notifications, for example 100 seconds
for a service with 500k notifications sent in the past 7 days.

Given neither are being used, this will give us the potential in
the next commit to reduce the number of slow queries and improve
page load times.

Note, I've kept the scope small by only introducing the new
pagination function for this one endpoint but there could be scope
in future to get all pagination using the next function if
appropriate.
2021-12-03 17:26:49 +00:00
David McDonald
a62e63fcef Add tests for existing pagination behaviour
No functionality change, just documenting what already exists
2021-12-03 17:21:14 +00:00
Leo Hemsted
a8cad79def Merge pull request #3390 from alphagov/reporting-q
put delete tasks on the reporting worker
2021-12-03 16:08:42 +00:00
Leo Hemsted
f6d210f1e6 put delete tasks on the reporting worker
they share a lot with the reporting tasks (creating ft_billing and
ft_notification_status), in that they're run nightly, take a long time,
and we see error messages if they get run multiple times (due to
visibility timeout).

The periodic app has two concurrent processes - previously there was
just one delete task, which would use one of those processes, while the
other process would pick up anything else on the queue (at that time of
night, the regular provider switch checks and scheduled job checks).
However, when we switched to running the three delete notification types
separately, we saw visibility timeout issues - three tasks would be
created, all three would be picked up by one celery instance, the two
worker processes would start on two of them, and the third would sit on
the box, wait longer than the visibility timeout to be picked up (and
acknowledged), and so SQS would assume the task was lost and replay it.

it's queues all the way down!

By putting them on the reporting worker we can take advantage of tuning
that app (for example setting the prefetch multiplier to one) which is
designed to run large tasks. We've also got more concurrent workers on
this box, so we can run all three tasks at once.
2021-12-03 13:28:16 +00:00
Leo Hemsted
595ff134d7 Merge pull request #3388 from alphagov/parallel-deletes
make delete notification tasks parallel by notification type
2021-12-02 10:33:53 +00:00
David McDonald
ad274ee887 Set count_pages as False to stop running of redundant query
This is a similar PR to https://github.com/alphagov/notifications-api/pull/2284.

When using flask-sqlalchemy to get a `Pagination` object, by default
it will run two queries

1. Get the page of results that you are asking for
2. If the number of results is equal to the page size, then it will
  issue a second query that will count the total number of results

Getting the total number of results is only useful if
- you need to show how many results there are
- you need to know if there is a next page of results (flask-sqlalchemy
  uses the total to work out how many pages there are altogether,
  which may not be the most efficient way of working out if there
  is a next page or not but that is what it currently does).

Looking at the `get_notifications` route, it does
not use `paginated_notifications.total` or
`paginated_notifications.has_next` and therefore we have no use
for the second query to get the total number of results.

We can stop this additional query by setting `count_pages=False`
which will hopefully give us some performance improvements, in
particular for services which send a lot of notifications.

Flask sqlalchemy references:
818c947b66/src/flask_sqlalchemy/__init__.py (L478)
818c947b66/src/flask_sqlalchemy/__init__.py (L399)

Note, I have checked the other uses of `get_notifications_for_service`
and the other cases are currently using the total or next page so
this approach is not something we can take with them.
2021-12-01 16:51:05 +00:00
Leo Hemsted
6bbec9f103 make delete notification tasks parallel by notification type
we used to do this until apr 2020. Let's try doing it again.
Back then, we had problems with timing. We did two things in spring
2020:

We moved to using an intermediary temp table [1]
We stopped the tasks being parallelised [2]

However, it turned out the real time saving was from changing what
services we delete for [3]. The task was actually CPU-bound rather than
DB-bound, so that's probably why having the tasks in parallel wasn't
helping, since they were all competing for the same CPU. It's worth
trying the parallel steps again now that we're no longer CPU bound.

Note: Temporary tables are in their own postgres schema, and are only
viewable by the current session (session == connection. Each celery
worker process has its own db connection). We don't need to worry about
separate workers both trying to use the same table at once.

I've also added a "DROP ON COMMIT" directive to the table definition
just to ensure it doesn't persist past the task even if there's an
exception. (This also drops on rollback).

Cronitor looks at the three functions separately so we don't need to worry
about the main task taking milliseconds where it used to take hours as
it isn't monitored itself.

I've also removed some unnecessary redundant exception logs.

[1] https://github.com/alphagov/notifications-api/pull/2767
[2] https://github.com/alphagov/notifications-api/pull/2798
[3] https://github.com/alphagov/notifications-api/pull/3381
2021-12-01 14:28:08 +00:00
Ben Thorner
6435b57cd1 Merge pull request #3384 from alphagov/fix-cronitor-test-180344153
Fix flakey Cronitor test using caplog fixture
2021-12-01 12:44:27 +00:00
Rebecca Law
c78e6c8571 Merge pull request #3386 from alphagov/increase-concurrency-for-reporting-app
Increase the concurrency for the delivery-worker-reporting
2021-12-01 11:51:50 +00:00
Rebecca Law
ddc03a9f5c Merge pull request #3385 from alphagov/improve-is_provider_slow-query
Improve query performance
2021-12-01 11:41:51 +00:00
Rebecca Law
e7efeec309 Increase the concurrency for the delivery-worker-reporting
TL;DR
After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term.

Longer read:
Every night we have 2 "reporting" tasks that run.
- create-nightly-billing starts at 00:15
  - populates data for ft_billing for the previous days.
  - 4 days for email
  - 4 days for sms
  - 10 days for letters
- create-nightly-notification-status starts at 00:30
  - populates data for ft_notification
  - 4 days for email
  - 4 days for sms
  - 10 days for letters

These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2.
This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30.
Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task.
If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task.
At some point all the tasks are completed, some are completed twice.
2021-12-01 11:40:18 +00:00
Rebecca Law
101498ec84 Improve query performance
Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.
2021-11-30 16:42:32 +00:00
Katie Smith
ad313065bf Merge pull request #3368 from alphagov/org-agreement-details
Add command to populate organisation table with agreement details
2021-11-30 14:30:56 +00:00
Katie Smith
250ce38cf2 Remove unecessary list-routes command
Since this was added, Flask now comes with a build in command to list
the routes, `flask routes`, so this is not needed.
2021-11-30 11:11:49 +00:00
Katie Smith
6d9f2c27d9 Add command to populate organisation table with agreement details
When we first started recording the details of the agreements that
were signed by organisations, we stored a copy of the signed agreement
in Google drive. Later, we switched to storing the details in the
database instead.

This adds a command which is designed to be run once and which updates
the database for the organisations which had the details of who accepted
the agreement and when stored in Google drive.
2021-11-30 11:11:49 +00:00