Just so other people don’t have to merge these changes.
The breaking changes don’t affect this repo because the API doesn’t:
- check the service guestlist before sending a message
- do any visual preview of emergency alert messages
> **51.0.0**
> - Initial argument to RecipientCSV renamed from whitelist to guestlist, in other words consuming code should call RecipientCSV(guestlist=['test@example.com'])
> - RecipientCSV.whitelist property renamed to RecipientCSV.guestlist
>
> **50.0.0**
> - Make icon in broadcast_preview_template.jinja2 an inline SVG (requires changes to the CSS of consumer code)
>
> **49.1.0**
> Add ttl_in_seconds argument to RequestCache.set to let users specify a custom TTL
This commit also changes the format of the line in the requirements
file, copying https://github.com/alphagov/notifications-admin/pull/4074/files
We have been running in to the problem in
https://github.com/pallets/flask-sqlalchemy/issues/518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.
As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
These don't appear to be used anywhere in the admin app and this
route is only used by the admin app. Therefore it is safe to remove
them.
We remove them because the calculate the total number of notifications
or the final page number of results can be particularly slow for
services with many many notifications, for example 100 seconds
for a service with 500k notifications sent in the past 7 days.
Given neither are being used, this will give us the potential in
the next commit to reduce the number of slow queries and improve
page load times.
Note, I've kept the scope small by only introducing the new
pagination function for this one endpoint but there could be scope
in future to get all pagination using the next function if
appropriate.
they share a lot with the reporting tasks (creating ft_billing and
ft_notification_status), in that they're run nightly, take a long time,
and we see error messages if they get run multiple times (due to
visibility timeout).
The periodic app has two concurrent processes - previously there was
just one delete task, which would use one of those processes, while the
other process would pick up anything else on the queue (at that time of
night, the regular provider switch checks and scheduled job checks).
However, when we switched to running the three delete notification types
separately, we saw visibility timeout issues - three tasks would be
created, all three would be picked up by one celery instance, the two
worker processes would start on two of them, and the third would sit on
the box, wait longer than the visibility timeout to be picked up (and
acknowledged), and so SQS would assume the task was lost and replay it.
it's queues all the way down!
By putting them on the reporting worker we can take advantage of tuning
that app (for example setting the prefetch multiplier to one) which is
designed to run large tasks. We've also got more concurrent workers on
this box, so we can run all three tasks at once.
we used to do this until apr 2020. Let's try doing it again.
Back then, we had problems with timing. We did two things in spring
2020:
We moved to using an intermediary temp table [1]
We stopped the tasks being parallelised [2]
However, it turned out the real time saving was from changing what
services we delete for [3]. The task was actually CPU-bound rather than
DB-bound, so that's probably why having the tasks in parallel wasn't
helping, since they were all competing for the same CPU. It's worth
trying the parallel steps again now that we're no longer CPU bound.
Note: Temporary tables are in their own postgres schema, and are only
viewable by the current session (session == connection. Each celery
worker process has its own db connection). We don't need to worry about
separate workers both trying to use the same table at once.
I've also added a "DROP ON COMMIT" directive to the table definition
just to ensure it doesn't persist past the task even if there's an
exception. (This also drops on rollback).
Cronitor looks at the three functions separately so we don't need to worry
about the main task taking milliseconds where it used to take hours as
it isn't monitored itself.
I've also removed some unnecessary redundant exception logs.
[1] https://github.com/alphagov/notifications-api/pull/2767
[2] https://github.com/alphagov/notifications-api/pull/2798
[3] https://github.com/alphagov/notifications-api/pull/3381
TL;DR
After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term.
Longer read:
Every night we have 2 "reporting" tasks that run.
- create-nightly-billing starts at 00:15
- populates data for ft_billing for the previous days.
- 4 days for email
- 4 days for sms
- 10 days for letters
- create-nightly-notification-status starts at 00:30
- populates data for ft_notification
- 4 days for email
- 4 days for sms
- 10 days for letters
These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2.
This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30.
Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task.
If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task.
At some point all the tasks are completed, some are completed twice.
Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.
When we first started recording the details of the agreements that
were signed by organisations, we stored a copy of the signed agreement
in Google drive. Later, we switched to storing the details in the
database instead.
This adds a command which is designed to be run once and which updates
the database for the organisations which had the details of who accepted
the agreement and when stored in Google drive.
This appears to not be thread safe: it started failing when run in
parallel with other tests in this PR [1]. We don't get much out of
using caplog over patching - it just proves our logging config isn't
swallowing the error logs, which we shouldn't need to test here.
[1]: https://github.com/alphagov/notifications-api/pull/3383
We are starting to see lots of 100.0%s in the current table
and we think this looks suspiciously too good so think it is
beneficial to change it to be 2dp such that we get a few more
non 100.0% values.
For the admin app to be able to show things to 2dp, we need to
give at least 2dp of accuracy otherwise we are losing 1dp of
granularity.
The approach is to just give all the granularity available by
returning the exact result from the DB and then the admin can
choose how many dps to use.
TLDR: Don't return as many services, and only return their IDs and not
the whole service objects.
Context:
the delete notifications nightly task has been taking longer and longer,
and to delete all three notification types in sequence it now takes up
to 8 hours.
This is because we were retrieving all services, loading them into
memory on the worker, and then trying to delete notifications for each
service in turn.
While it does use a fair chunk of IOPS/CPU on our postgres db, we're not
anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1]
The real issue appears to be that the task is CPU bound on the periodic
worker - we see the worker spike up to 100% CPU regularly across the
whole 3am-11am period.
We also noticed that for each notification type the task first processes
services with custom data retention (not many but some of the biggest
users), then deals with all other services. We can see from looking at
kibana that, for example, the task starts at 3am, and the custom data
retention service email deletions are finished by 3:12am. The rest of
the emails don't get deleted until 5am, so we knew that the problem is
with how it handles the other services.
There are currently 17000 services in the database. On a typical day,
~800 services will have notifications that are over 7 days old and need
to be deleted. By only returning these services, we reduce the amount of
data transfer and serialisation that needs to happen. It takes about two
minutes to retrieve the distinct service ids from the notifications
table for sms notifications, but that is only 5% the size of the full
list so cuts down on a lot of processing
Also, by only returning service_ids rather than the whole `Service`
model we avoid sqlalchemy needing to do lots of data serialisation, when
we were only using the `Service.id` field from that result anyway.
[1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics
[2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now
[3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))
For preview and staging environments, we often send no messages
in a single day. This is currently causing a `DivisionByZero` error
that is rendering the page with no results. This makes it impossible
to look at preview/staging and see if the performance page is
working correctly or not.
(psycopg2.errors.DivisionByZero) division by zero
[SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage
FROM ft_processing_time
WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date]
[parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}]
(Background on this error at: http://sqlalche.me/e/14/9h9h)
I've fixed this by falling back to 100.0% for days we send
no messages. Maybe some argument that it should be N/A rather than
100% but I think it doesn't really matter as this is only
going to affect preview and staging as we will never have a day
sending no messages in production.
If a polygon is smaller than the largest polygon in our dataset of
simplified polygons then we’re only throwing away useful detail by
simplifying it.
We should still simplify larger polygons as a fallback, to avoid sending
anything to the CBC that we’re not sure it will like.
The thresholds here are low: we can raise them as we test and experiment
more.
Here’s some data about the Flood Warning Service polygons
Percentile | 80% | 90% | 95% | 98% | 99% | 99.9%
-----------|-----|-------|--------|---------|---------|---------
Point count| 226 | 401.9 | 640.45 | 1015.38 | 1389.07 | 3008.609
Percentile | 80% | 90% | 95% | 98% | 99% | 99.9%
--------------|-----|-------|--------|---------|---------|---------
Polygon count |2----|3------|5-------|8--------|10-------|40.469
This new version of utils implements the transformation of our polygons
to a Cartesian plane. In other words, it converts them from being
defined in spherical degrees to metres.
For the API this means our simplification will be slightly more
accurate.
As stated in the comment, this would have been helpful during an
incident to give further reassurance that a task had at least
started running - at the time the only evidence for this was the
Cronitor dashboard itself, which we don't often look at.
I've removed other, equivalent "starting" logs, but kept those
that provide additional information in the log message.
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
This seems to be an issue for several people when we install new
versions of the package. Older versions of the package seem to
be equally affected, so the new need for this is likely related
to us using a newer OS / XCode version.
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.
We need to remove the banner when the day is over.
The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.
1. 5a274af6d0/app/models/alerts.py (L38-L44)
There were two problems with the existing message.
1. There was no space between the new status and the time taken
which made reading and searching harder
2. They key bits of information (before and after status) were
separated by the time taken (which will always be unique) meaning
you couldn't do an easy search for a message that is say in delivered
being attempted to be set to temporary-failure.