Commit Graph

4857 Commits

Author SHA1 Message Date
Ben Thorner
c8cf057eba Record providers we time out notifications for
This will help us monitor issues with delivery receipts and keep
track of provider performance over time.

I'm not concerned about performance here:

- The number of notifications to time out is usually small.
- This task only runs once a day.
- Calls to StatsD are quick and cheap.
2021-12-14 13:04:39 +00:00
Ben Thorner
11278c47f5 Replace log with StatsD gauge for slow delivery
A gauge is more useful as we can visualise it and combine it with
other stats - we already have other stats for the total number of
notifications sent by provider, and we can extrapolate the number
of slow notifications using this, if needed.

We also still have logs to say the task is running, as well as a
log in the calling code when we actually make a switch [1], so
we're not losing anything by removing the log here.

[1]: a9306c4557/app/celery/scheduled_tasks.py (L117)
2021-12-14 13:03:43 +00:00
Ben Thorner
87cd40d00a Scale timeout task to work on arbitrary volumes
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.

This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.

[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
2021-12-13 17:14:28 +00:00
Ben Thorner
2adaaac3ae Remove redundant conditions for update query
Filtering by ID is enough, noting the other conditions were the
same between both queries.
2021-12-13 17:03:07 +00:00
Ben Thorner
c8ebb365d4 Make limit of DAO timeout function more obvious
We're going to iterate how we use the function with a limit, so we
shouldn't say it's "temporary" anymore. We don't need to change the
default, but having it in the function parameters makes it easier to
see the funtion doesn't time out all notifications, just some.
2021-12-13 17:01:41 +00:00
Ben Thorner
76aeab24ce Rewrite DAO timeout method to take cutoff_time
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
2021-12-13 16:56:21 +00:00
Ben Thorner
3bcaf8330e Simplify comment for DAO timeout function 2021-12-13 16:39:55 +00:00
Ben Thorner
2fb432adaf Merge pull request #3383 from alphagov/email-sms-created-alert-180344153
Add new log / alert for 'created' email / SMS
2021-12-13 12:56:05 +00:00
David McDonald
c25585fd60 Merge pull request #3394 from alphagov/dont-count-pages
Improve response times for figuring out pagination links by doing it ourselves rather than use Flask-Sqlalchemy
2021-12-13 11:43:02 +00:00
David McDonald
eba625a9f5 Merge pull request #3389 from alphagov/improve-pagination-queries
Set `count_pages` as False to stop running of redundant query
2021-12-13 11:41:33 +00:00
David McDonald
7d8eed8228 Optimise queries run for creating pagination links
We have been running in to the problem in
pallets/flask-sqlalchemy#518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.

As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.

This commit is analagous to
c68d1a2f23

The only difference is that in that case, the pagination links are
used to show prev and/or next links in the admin app. In this case,
the pagination links are only used to see if there is a page 2, and
if there is, say that we are only showing the first 50 results.
2021-12-10 17:47:27 +00:00
David McDonald
edadeb9131 Use get_prev_next_pagination_links when searching by to field
The only change in behaviour is that we are no longer including a
`last` pagination link.

This is OK because the frontend doesnt use it, just the prev and
next links as per
https://github.com/alphagov/notifications-admin/blob/master/app/main/views/jobs.py#L248
2021-12-10 12:29:55 +00:00
David McDonald
ec6ed3958c Move get_prev_next_pagination_links to utils
This will mean it can later be reused whereever we want
2021-12-10 12:26:57 +00:00
Ben Thorner
a8edfeb941 Remove command to replay callbacks
In response to [1].

I've already removed the runbook that referred to this.

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r765644576
2021-12-09 10:46:19 +00:00
David McDonald
1973994516 Merge pull request #3391 from alphagov/pagination-approach-change
Pagination approach change for `get_notifications_for_service`
2021-12-09 10:43:14 +00:00
Ben Thorner
ab4cb029df Remove alert for email / sms in created
In response to [1].

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988

It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.

It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
2021-12-06 14:11:42 +00:00
Ben Thorner
04da017558 DRY-up conditionally creating callback tasks
This removes 3 duplicate instances of the same code, which is still
tested implicitly via test_process_ses_receipt_tasks [1]. In the
next commit we'll make this test more explicit, to reflect that it's
now being reused elsewhere and shouldn't change arbitrarily.

We do lose the "print" statement from the command instance of the
code, but I think that's a very tolerable loss.

[1]: 16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)
2021-12-06 14:11:34 +00:00
Ben Thorner
97b58ed4c3 Remove unnecessary _timeout partial function
It's no longer necessary to have a separate function that's now
only called once. While sometimes the separation can bring clarity,
here I think it's clearer to have all the code in one place, and
avoid the functools complexity we had before.
2021-12-06 14:00:37 +00:00
Ben Thorner
0318229216 Stop 'timing out' old 'created' notifications
This is being replaced with a new alert and runbook [1]. It's not
always appropriate to change the status to 'technical-failure', and
the new alert means we'll act to fix the underlying issue promptly.

We'll look at tidying up the remaining code in the next commits.

[1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created
2021-12-06 14:00:36 +00:00
Ben Thorner
2acc4ee67d Repurpose command to replay notification callbacks
This is so we can use it to address issues highlighted by the new
alert, if it's not possible to actually send the notifications e.g.
if they are somehow 'invalid'.

Previously this was added for a one-off use case [1]. This rewrites
the task to operate on arbitrary notification IDs instead of client
refs, which aren't always present for notifications we may want to
send / replay callbacks for. Since the task may now need to work on
notifications more than one service, I had to restructure it to cope
with multiple callback APIs.

Note that, in the test, I've chosen to do a chain of invocations and
assertions, rather than duplicate a load of boilerplate or introduce
funky parametrize flags for a service with/out a callback API. We'll
refactor this in a later commit.

[1]: e95740a6b5
2021-12-06 14:00:35 +00:00
Ben Thorner
f96ba5361a Add new task to alert about created email / sms
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.

In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
2021-12-06 14:00:31 +00:00
David McDonald
e8dd136678 Document area that may be doing pagination links when not needed 2021-12-03 17:32:40 +00:00
David McDonald
c68d1a2f23 Optimise queries run for creating pagination links
We have been running in to the problem in
https://github.com/pallets/flask-sqlalchemy/issues/518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.

As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
2021-12-03 17:32:39 +00:00
David McDonald
989ef9c21a Remove last and total keys from pagination links
These don't appear to be used anywhere in the admin app and this
route is only used by the admin app. Therefore it is safe to remove
them.

We remove them because the calculate the total number of notifications
or the final page number of results can be particularly slow for
services with many many notifications, for example 100 seconds
for a service with 500k notifications sent in the past 7 days.

Given neither are being used, this will give us the potential in
the next commit to reduce the number of slow queries and improve
page load times.

Note, I've kept the scope small by only introducing the new
pagination function for this one endpoint but there could be scope
in future to get all pagination using the next function if
appropriate.
2021-12-03 17:26:49 +00:00
Leo Hemsted
f6d210f1e6 put delete tasks on the reporting worker
they share a lot with the reporting tasks (creating ft_billing and
ft_notification_status), in that they're run nightly, take a long time,
and we see error messages if they get run multiple times (due to
visibility timeout).

The periodic app has two concurrent processes - previously there was
just one delete task, which would use one of those processes, while the
other process would pick up anything else on the queue (at that time of
night, the regular provider switch checks and scheduled job checks).
However, when we switched to running the three delete notification types
separately, we saw visibility timeout issues - three tasks would be
created, all three would be picked up by one celery instance, the two
worker processes would start on two of them, and the third would sit on
the box, wait longer than the visibility timeout to be picked up (and
acknowledged), and so SQS would assume the task was lost and replay it.

it's queues all the way down!

By putting them on the reporting worker we can take advantage of tuning
that app (for example setting the prefetch multiplier to one) which is
designed to run large tasks. We've also got more concurrent workers on
this box, so we can run all three tasks at once.
2021-12-03 13:28:16 +00:00
David McDonald
ad274ee887 Set count_pages as False to stop running of redundant query
This is a similar PR to https://github.com/alphagov/notifications-api/pull/2284.

When using flask-sqlalchemy to get a `Pagination` object, by default
it will run two queries

1. Get the page of results that you are asking for
2. If the number of results is equal to the page size, then it will
  issue a second query that will count the total number of results

Getting the total number of results is only useful if
- you need to show how many results there are
- you need to know if there is a next page of results (flask-sqlalchemy
  uses the total to work out how many pages there are altogether,
  which may not be the most efficient way of working out if there
  is a next page or not but that is what it currently does).

Looking at the `get_notifications` route, it does
not use `paginated_notifications.total` or
`paginated_notifications.has_next` and therefore we have no use
for the second query to get the total number of results.

We can stop this additional query by setting `count_pages=False`
which will hopefully give us some performance improvements, in
particular for services which send a lot of notifications.

Flask sqlalchemy references:
818c947b66/src/flask_sqlalchemy/__init__.py (L478)
818c947b66/src/flask_sqlalchemy/__init__.py (L399)

Note, I have checked the other uses of `get_notifications_for_service`
and the other cases are currently using the total or next page so
this approach is not something we can take with them.
2021-12-01 16:51:05 +00:00
Leo Hemsted
6bbec9f103 make delete notification tasks parallel by notification type
we used to do this until apr 2020. Let's try doing it again.
Back then, we had problems with timing. We did two things in spring
2020:

We moved to using an intermediary temp table [1]
We stopped the tasks being parallelised [2]

However, it turned out the real time saving was from changing what
services we delete for [3]. The task was actually CPU-bound rather than
DB-bound, so that's probably why having the tasks in parallel wasn't
helping, since they were all competing for the same CPU. It's worth
trying the parallel steps again now that we're no longer CPU bound.

Note: Temporary tables are in their own postgres schema, and are only
viewable by the current session (session == connection. Each celery
worker process has its own db connection). We don't need to worry about
separate workers both trying to use the same table at once.

I've also added a "DROP ON COMMIT" directive to the table definition
just to ensure it doesn't persist past the task even if there's an
exception. (This also drops on rollback).

Cronitor looks at the three functions separately so we don't need to worry
about the main task taking milliseconds where it used to take hours as
it isn't monitored itself.

I've also removed some unnecessary redundant exception logs.

[1] https://github.com/alphagov/notifications-api/pull/2767
[2] https://github.com/alphagov/notifications-api/pull/2798
[3] https://github.com/alphagov/notifications-api/pull/3381
2021-12-01 14:28:08 +00:00
Rebecca Law
101498ec84 Improve query performance
Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.
2021-11-30 16:42:32 +00:00
Katie Smith
250ce38cf2 Remove unecessary list-routes command
Since this was added, Flask now comes with a build in command to list
the routes, `flask routes`, so this is not needed.
2021-11-30 11:11:49 +00:00
Katie Smith
6d9f2c27d9 Add command to populate organisation table with agreement details
When we first started recording the details of the agreements that
were signed by organisations, we stored a copy of the signed agreement
in Google drive. Later, we switched to storing the details in the
database instead.

This adds a command which is designed to be run once and which updates
the database for the organisations which had the details of who accepted
the agreement and when stored in Google drive.
2021-11-30 11:11:49 +00:00
David McDonald
16ec8ccb8a Merge pull request #3382 from alphagov/2dp
Report processing stats to 2dp rather than 1dp
2021-11-26 09:48:05 +00:00
David McDonald
648490bf62 Report processing stats as floats rather than 1dp
We are starting to see lots of 100.0%s in the current table
and we think this looks suspiciously too good so think it is
beneficial to change it to be 2dp such that we get a few more
non 100.0% values.

For the admin app to be able to show things to 2dp, we need to
give at least 2dp of accuracy otherwise we are losing 1dp of
granularity.

The approach is to just give all the granularity available by
returning the exact result from the DB and then the admin can
choose how many dps to use.
2021-11-25 17:14:34 +00:00
Leo Hemsted
bab659c677 reduce number of services we try and delete notifications for
TLDR: Don't return as many services, and only return their IDs and not
the whole service objects.

Context:

the delete notifications nightly task has been taking longer and longer,
and to delete all three notification types in sequence it now takes up
to 8 hours.

This is because we were retrieving all services, loading them into
memory on the worker, and then trying to delete notifications for each
service in turn.

While it does use a fair chunk of IOPS/CPU on our postgres db, we're not
anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1]

The real issue appears to be that the task is CPU bound on the periodic
worker - we see the worker spike up to 100% CPU regularly across the
whole 3am-11am period.

We also noticed that for each notification type the task first processes
services with custom data retention (not many but some of the biggest
users), then deals with all other services. We can see from looking at
kibana that, for example, the task starts at 3am, and the custom data
retention service email deletions are finished by 3:12am. The rest of
the emails don't get deleted until 5am, so we knew that the problem is
with how it handles the other services.

There are currently 17000 services in the database. On a typical day,
~800 services will have notifications that are over 7 days old and need
to be deleted. By only returning these services, we reduce the amount of
data transfer and serialisation that needs to happen. It takes about two
minutes to retrieve the distinct service ids from the notifications
table for sms notifications, but that is only 5% the size of the full
list so cuts down on a lot of processing

Also, by only returning service_ids rather than the whole `Service`
model we avoid sqlalchemy needing to do lots of data serialisation, when
we were only using the `Service.id` field from that result anyway.

[1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics
[2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now
[3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))
2021-11-24 16:18:40 +00:00
David McDonald
18776e4160 Merge pull request #3377 from alphagov/zero-case-performance-page
Fix division by zero error on performance page
2021-11-22 13:44:32 +00:00
David McDonald
106187ba04 Fix division by zero error on performance page
For preview and staging environments, we often send no messages
in a single day. This is currently causing a `DivisionByZero` error
that is rendering the page with no results. This makes it impossible
to look at preview/staging and see if the performance page is
working correctly or not.

(psycopg2.errors.DivisionByZero) division by zero

[SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage
FROM ft_processing_time
WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date]
[parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}]
(Background on this error at: http://sqlalche.me/e/14/9h9h)

I've fixed this by falling back to 100.0% for days we send
no messages. Maybe some argument that it should be N/A rather than
100% but I think it doesn't really matter as this is only
going to affect preview and staging as we will never have a day
sending no messages in production.
2021-11-22 11:11:52 +00:00
Chris Hill-Scott
c0742fe83d Pass polygons through if they’re small already
If a polygon is smaller than the largest polygon in our dataset of
simplified polygons then we’re only throwing away useful detail by
simplifying it.

We should still simplify larger polygons as a fallback, to avoid sending
anything to the CBC that we’re not sure it will like.

The thresholds here are low: we can raise them as we test and experiment
more.

Here’s some data about the Flood Warning Service polygons

Percentile | 80% | 90%   | 95%    | 98%     | 99%     | 99.9%
-----------|-----|-------|--------|---------|---------|---------
Point count| 226 | 401.9 | 640.45 | 1015.38 | 1389.07 | 3008.609

Percentile    | 80% | 90%   | 95%    | 98%     | 99%     | 99.9%
--------------|-----|-------|--------|---------|---------|---------
Polygon count |2----|3------|5-------|8--------|10-------|40.469
2021-11-18 15:48:45 +00:00
Chris Hill-Scott
4feb3fdc10 Bump utils
This new version of utils implements the transformation of our polygons
to a Cartesian plane. In other words, it converts them from being
defined in spherical degrees to metres.

For the API this means our simplification will be slightly more
accurate.
2021-11-18 15:43:39 +00:00
Rebecca Law
30a5852685 Update the query to only return the count from the table since that is
all we care about.

https://www.pivotaltracker.com/story/show/180262357
2021-11-17 14:46:52 +00:00
Ben Thorner
bffca39223 Merge pull request #3373 from alphagov/centralise-celery-180213914
Use central NotifyCelery base class in utils
2021-11-17 12:09:28 +00:00
Ben Thorner
666ac1ab4f Log activity on all periodic Celery tasks
As stated in the comment, this would have been helpful during an
incident to give further reassurance that a task had at least
started running - at the time the only evidence for this was the
Cronitor dashboard itself, which we don't often look at.

I've removed other, equivalent "starting" logs, but kept those
that provide additional information in the log message.
2021-11-17 09:48:03 +00:00
Ben Thorner
528223ed61 Use central NotifyCelery base class in utils
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
2021-11-16 13:58:12 +00:00
Chris Hill-Scott
0236318189 Republish gov.uk/alerts every night to clear down planned tests
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.

We need to remove the banner when the day is over.

The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.

1. 5a274af6d0/app/models/alerts.py (L38-L44)
2021-11-15 14:23:32 +00:00
David McDonald
608ef12573 Merge pull request #3367 from alphagov/better-log-message
Improve log message searchability for duplicate receipts
2021-11-15 09:38:30 +00:00
David McDonald
c98996a461 Improve log message searchability for duplicate receipts
There were two problems with the existing message.

1. There was no space between the new status and the time taken
   which made reading and searching harder
2. They key bits of information (before and after status) were
   separated by the time taken (which will always be unique) meaning
   you couldn't do an easy search for a message that is say in delivered
   being attempted to be set to temporary-failure.
2021-11-12 14:06:38 +00:00
Ben Thorner
48e1482d90 Merge pull request #3366 from alphagov/celery-extend-request-id-180213914
Extend request tracing to cover Celery logs
2021-11-12 11:10:38 +00:00
Ben Thorner
d66c68d6d6 Merge pull request #3364 from alphagov/celery-headers-request-id-180213914
Move Celery task Request ID injection into headers
2021-11-12 11:10:29 +00:00
Ben Thorner
4a577eca62 Merge pull request #3359 from alphagov/improve-clarify-botocore-exception-180017131
Improve and clarify large task error handling
2021-11-12 11:10:17 +00:00
Ben Thorner
1872854a4e Improve and clarify large task error handling
Previously we were catching one type of exception if something went
wrong adding a notification to the queue for high volume services.
In reality there are two types of exception so this adds a second
handler to cover both.

For context, this is code we changed experimentally as part of the
upgrade to Celery 5 [1]. At the time we didn't check how the new
exception compared to the old one. It turns out they behaved the
same and we were always vulnerable to the scenario now covered by
the second exception, where the behaviour has changed in Celery 5 -
testing with a large task invocation gives...

Before (Celery 3, large-ish task):

    'process_job.apply_async(["a" * 200000])'...

    boto.exception.SQSError: SQSError: 400 Bad Request
    <?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidParameterValue</Code><Message>One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.</Message><Detail/></Error><RequestId>96162552-cd96-5a14-b3a5-7f503300a662</RequestId></ErrorResponse>

Before (Celery 3, very large task):

    <hangs forever>

After (Celery 5, large-ish task):

    botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the SendMessage operation: One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.

After (Celery 5, very large task):

    botocore.parsers.ResponseParserError: Unable to parse response (syntax error: line 1, column 0), invalid XML received. Further retries may succeed:
    b'HTTP content length exceeded 1662976 bytes.'

[1]: 29c92a9e54
2021-11-11 17:37:50 +00:00
Ben Thorner
ac06529128 Enable request tracing on Celery success/fail logs
Previously these logs wouldn't have a Request ID attached since the
Celery hooks run after the __call__ method where we enable request
tracing for normal application logs. For the failure log especially
it will be useful to have this feature.
2021-11-10 18:04:20 +00:00
Ben Thorner
369a9f7521 Refactor queue_name and request_id into properties
This reduces the complexity of the original functions, which will
go up a bit in the next commit.
2021-11-10 18:04:19 +00:00