Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.
This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.
[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
We're going to iterate how we use the function with a limit, so we
shouldn't say it's "temporary" anymore. We don't need to change the
default, but having it in the function parameters makes it easier to
see the funtion doesn't time out all notifications, just some.
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
Previously most of the assertions were being run *before* we had
actually called the function. There was also a redundant block of
assertions that just asserted the initial state of the test data.
We have been running in to the problem in
pallets/flask-sqlalchemy#518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.
As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
This commit is analagous to
c68d1a2f23
The only difference is that in that case, the pagination links are
used to show prev and/or next links in the admin app. In this case,
the pagination links are only used to see if there is a page 2, and
if there is, say that we are only showing the first 50 results.
This includes performance improvements for RecipientCSV, which may
reduce the processing time in some edge cases - this depends on if
the Admin app rejects CSVs with these edge cases.
In response to [1].
[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988
It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.
It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
This removes 3 duplicate instances of the same code, which is still
tested implicitly via test_process_ses_receipt_tasks [1]. In the
next commit we'll make this test more explicit, to reflect that it's
now being reused elsewhere and shouldn't change arbitrarily.
We do lose the "print" statement from the command instance of the
code, but I think that's a very tolerable loss.
[1]: 16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)
This now matches the behaviour of the test above it: mocking out
the DAO function in order to focus on the specific behaviour of
the function under test.
These scenarios are already covered by the DAO tests. It's enough
to just check the DAO function is called as expected.
While sometimes it can be better to have more end-to-end tests, the
convention across much of this app is to do unit tests.
I find it really difficult to visually parse test files unless we
have a consistent convention for how we name our test functions.
In most of our tests the name of the test function starts with the
name of the function under test.
It's no longer necessary to have a separate function that's now
only called once. While sometimes the separation can bring clarity,
here I think it's clearer to have all the code in one place, and
avoid the functools complexity we had before.
This is so we can use it to address issues highlighted by the new
alert, if it's not possible to actually send the notifications e.g.
if they are somehow 'invalid'.
Previously this was added for a one-off use case [1]. This rewrites
the task to operate on arbitrary notification IDs instead of client
refs, which aren't always present for notifications we may want to
send / replay callbacks for. Since the task may now need to work on
notifications more than one service, I had to restructure it to cope
with multiple callback APIs.
Note that, in the test, I've chosen to do a chain of invocations and
assertions, rather than duplicate a load of boilerplate or introduce
funky parametrize flags for a service with/out a callback API. We'll
refactor this in a later commit.
[1]: e95740a6b5
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.
In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
Just so other people don’t have to merge these changes.
The breaking changes don’t affect this repo because the API doesn’t:
- check the service guestlist before sending a message
- do any visual preview of emergency alert messages
> **51.0.0**
> - Initial argument to RecipientCSV renamed from whitelist to guestlist, in other words consuming code should call RecipientCSV(guestlist=['test@example.com'])
> - RecipientCSV.whitelist property renamed to RecipientCSV.guestlist
>
> **50.0.0**
> - Make icon in broadcast_preview_template.jinja2 an inline SVG (requires changes to the CSS of consumer code)
>
> **49.1.0**
> Add ttl_in_seconds argument to RequestCache.set to let users specify a custom TTL
This commit also changes the format of the line in the requirements
file, copying https://github.com/alphagov/notifications-admin/pull/4074/files
We have been running in to the problem in
https://github.com/pallets/flask-sqlalchemy/issues/518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.
As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
These don't appear to be used anywhere in the admin app and this
route is only used by the admin app. Therefore it is safe to remove
them.
We remove them because the calculate the total number of notifications
or the final page number of results can be particularly slow for
services with many many notifications, for example 100 seconds
for a service with 500k notifications sent in the past 7 days.
Given neither are being used, this will give us the potential in
the next commit to reduce the number of slow queries and improve
page load times.
Note, I've kept the scope small by only introducing the new
pagination function for this one endpoint but there could be scope
in future to get all pagination using the next function if
appropriate.
they share a lot with the reporting tasks (creating ft_billing and
ft_notification_status), in that they're run nightly, take a long time,
and we see error messages if they get run multiple times (due to
visibility timeout).
The periodic app has two concurrent processes - previously there was
just one delete task, which would use one of those processes, while the
other process would pick up anything else on the queue (at that time of
night, the regular provider switch checks and scheduled job checks).
However, when we switched to running the three delete notification types
separately, we saw visibility timeout issues - three tasks would be
created, all three would be picked up by one celery instance, the two
worker processes would start on two of them, and the third would sit on
the box, wait longer than the visibility timeout to be picked up (and
acknowledged), and so SQS would assume the task was lost and replay it.
it's queues all the way down!
By putting them on the reporting worker we can take advantage of tuning
that app (for example setting the prefetch multiplier to one) which is
designed to run large tasks. We've also got more concurrent workers on
this box, so we can run all three tasks at once.
This is a similar PR to https://github.com/alphagov/notifications-api/pull/2284.
When using flask-sqlalchemy to get a `Pagination` object, by default
it will run two queries
1. Get the page of results that you are asking for
2. If the number of results is equal to the page size, then it will
issue a second query that will count the total number of results
Getting the total number of results is only useful if
- you need to show how many results there are
- you need to know if there is a next page of results (flask-sqlalchemy
uses the total to work out how many pages there are altogether,
which may not be the most efficient way of working out if there
is a next page or not but that is what it currently does).
Looking at the `get_notifications` route, it does
not use `paginated_notifications.total` or
`paginated_notifications.has_next` and therefore we have no use
for the second query to get the total number of results.
We can stop this additional query by setting `count_pages=False`
which will hopefully give us some performance improvements, in
particular for services which send a lot of notifications.
Flask sqlalchemy references:
818c947b66/src/flask_sqlalchemy/__init__.py (L478)818c947b66/src/flask_sqlalchemy/__init__.py (L399)
Note, I have checked the other uses of `get_notifications_for_service`
and the other cases are currently using the total or next page so
this approach is not something we can take with them.
we used to do this until apr 2020. Let's try doing it again.
Back then, we had problems with timing. We did two things in spring
2020:
We moved to using an intermediary temp table [1]
We stopped the tasks being parallelised [2]
However, it turned out the real time saving was from changing what
services we delete for [3]. The task was actually CPU-bound rather than
DB-bound, so that's probably why having the tasks in parallel wasn't
helping, since they were all competing for the same CPU. It's worth
trying the parallel steps again now that we're no longer CPU bound.
Note: Temporary tables are in their own postgres schema, and are only
viewable by the current session (session == connection. Each celery
worker process has its own db connection). We don't need to worry about
separate workers both trying to use the same table at once.
I've also added a "DROP ON COMMIT" directive to the table definition
just to ensure it doesn't persist past the task even if there's an
exception. (This also drops on rollback).
Cronitor looks at the three functions separately so we don't need to worry
about the main task taking milliseconds where it used to take hours as
it isn't monitored itself.
I've also removed some unnecessary redundant exception logs.
[1] https://github.com/alphagov/notifications-api/pull/2767
[2] https://github.com/alphagov/notifications-api/pull/2798
[3] https://github.com/alphagov/notifications-api/pull/3381
TL;DR
After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term.
Longer read:
Every night we have 2 "reporting" tasks that run.
- create-nightly-billing starts at 00:15
- populates data for ft_billing for the previous days.
- 4 days for email
- 4 days for sms
- 10 days for letters
- create-nightly-notification-status starts at 00:30
- populates data for ft_notification
- 4 days for email
- 4 days for sms
- 10 days for letters
These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2.
This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30.
Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task.
If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task.
At some point all the tasks are completed, some are completed twice.
Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.
When we first started recording the details of the agreements that
were signed by organisations, we stored a copy of the signed agreement
in Google drive. Later, we switched to storing the details in the
database instead.
This adds a command which is designed to be run once and which updates
the database for the organisations which had the details of who accepted
the agreement and when stored in Google drive.