This deletes a big ol' chunk of code related to letters. It's not everything—there are still a few things that might be tied to sms/email—but it's the the heart of letters function. SMS and email function should be untouched by this.
Areas affected:
- Things obviously about letters
- PDF tasks, used for precompiling letters
- Virus scanning, used for those PDFs
- FTP, used to send letters to the printer
- Postage stuff
When we cloned the repository and started making modifications, we
didn't initially keep tests in step. This commit tries to get us to a
clean test run by skipping tests that are failing and removing some
that we no longer expect to use (MMG, Firetext), with the intention that
we will come back in future and update or remove them as appropriate.
To find all tests skipped, search for `@pytest.mark.skip(reason="Needs
updating for TTS:`. There will be a brief description of the work that
needs to be done to get them passing, if known. Delete that line to make
them run in a standard test run (`make test`).
This fixes a bug where (letter) notifications left in sending would
temporarily get excluded from billing and status calculations once
the service retention period had elapsed, and then get included once
again when they finally get marked as delivered.*
Status and billing tasks shouldn't need to have knowledge about which
table their data is in and getting this wrong is the fundamental cause
of the bug here. Adding a view across both tables abstracts this away
while keeping the query complexity the same.
Using a view also has the added benefit that we no longer need to care
when the status / billing tasks run in comparison to the deletion task,
since we will retrieve the same data irrespective (see below for a more
detailed discussion on data integrity).
*Such a scenario is rare but has happened.
A New View
==========
I've included all the columns that are shared between the two tables,
even though only a subset are actually needed. Having extra columns
has no impact and may be useful in future.
Although the view isn't actually a table, SQLAlchemy appears to wrap
it without any issues, noting that the package doesn't have any direct
support for "view models". Because we're never inserting data, we don't
need most of the kwargs when defining columns.*
*Note that the "default" kwarg doesn't affect data that's retrieved,
only data that's written (if no value is set).
Data Integrity
==============
The (new) tests cover the main scenarios.
We need to be careful with how the view interacts with the deletion /
archiving task. There are two concerns here:
- Duplicates. The deletion task inserts before it deletes [^1], so we
could end up double counting. It turns out this isn't a problem because
a Postgres UNION is an implicit "DISTINCT" [^2]. I've also verified this
manually, just to be on the safe side.
- No data. It's conceivable that the query will check the history table
just before the insertion, then check the notifications table just after
the deletion. It turns out this isn't a problem either because the whole
query sees the same DB snapshot [^3][^4].*
*I can't think of a way to test this as it's a race condition, but I'm
confident the Postgres docs are accurate.
Performance
===========
I copied the relevant (non-PII) columns from Production for data going
back to 2022-04-01. I then ran several tests.
Queries using the new view still make use of indices on a per-table basis,
as the following query plan illustrates:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1130820.02..1135353.89 rows=46502 width=97) (actual time=629.863..756.703 rows=72 loops=1)
Group Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
-> Sort (cost=1130820.02..1131401.28 rows=232506 width=85) (actual time=629.756..708.914 rows=217563 loops=1)
Sort Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
Sort Method: external merge Disk: 9320kB
-> Subquery Scan on notifications_all_time_view (cost=1088506.43..1098969.20 rows=232506 width=85) (actual time=416.118..541.669 rows=217563 loops=1)
-> Unique (cost=1088506.43..1096644.14 rows=232506 width=725) (actual time=416.115..513.065 rows=217563 loops=1)
-> Sort (cost=1088506.43..1089087.70 rows=232506 width=725) (actual time=416.115..451.190 rows=217563 loops=1)
Sort Key: notifications_no_pii.id, notifications_no_pii.job_id, notifications_no_pii.service_id, notifications_no_pii.template_id, notifications_no_pii.key_type, notifications_no_pii.billable_units, notifications_no_pii.notification_type, notifications_no_pii.created_at, notifications_no_pii.sent_by, notifications_no_pii.notification_status, notifications_no_pii.international, notifications_no_pii.rate_multiplier, notifications_no_pii.postage
Sort Method: external merge Disk: 23936kB
-> Append (cost=114.42..918374.12 rows=232506 width=725) (actual time=2.051..298.229 rows=217563 loops=1)
-> Bitmap Heap Scan on notifications_no_pii (cost=114.42..8557.55 rows=2042 width=113) (actual time=1.405..1.442 rows=0 loops=1)
Recheck Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: ((key_type)::text = ANY ('{normal,team}'::text[]))
-> Bitmap Index Scan on ix_notifications_no_piiservice_id_composite (cost=0.00..113.91 rows=2202 width=0) (actual time=1.402..1.439 rows=0 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
-> Index Scan using ix_notifications_history_no_pii_service_id_composite on notifications_history_no_pii (cost=0.70..906328.97 rows=230464 width=113) (actual time=0.645..281.612 rows=217563 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND ((key_type)::text = ANY ('{normal,team}'::text[])) AND (notification_type = 'sms'::notification_type) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[]))
Planning Time: 18.032 ms
Execution Time: 759.001 ms
(21 rows)
Queries using the new view appear to be slower than without, but the
differences I've seen are minimal: the original queries execute in
seconds locally and in Production, so it's not a big issue.
Notes: Performance
==================
I downloaded a minimal set of columns for testing:
\copy (
select
id, notification_type, key_type, created_at, service_id,
template_id, sent_by, rate_multiplier, international,
billable_units, postage, job_id, notification_status
from notifications
) to 'notifications.csv' delimiter ',' csv header;
CREATE TABLE notifications_no_pii (
id uuid NOT NULL,
notification_type public.notification_type NOT NULL,
key_type character varying(255) NOT NULL,
created_at timestamp without time zone NOT NULL,
service_id uuid,
template_id uuid,
sent_by character varying,
rate_multiplier numeric,
international boolean,
billable_units integer NOT NULL,
postage character varying,
job_id uuid,
notification_status text
);
copy notifications_no_pii from '/Users/ben.thorner/Desktop/notifications.csv' delimiter ',' csv header;
CREATE INDEX ix_notifications_no_piicreated_at ON notifications_no_pii USING btree (created_at);
CREATE INDEX ix_notifications_no_piijob_id ON notifications_no_pii USING btree (job_id);
CREATE INDEX ix_notifications_no_piinotification_type_composite ON notifications_no_pii USING btree (notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piiservice_created_at ON notifications_no_pii USING btree (service_id, created_at);
CREATE INDEX ix_notifications_no_piiservice_id_composite ON notifications_no_pii USING btree (service_id, notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piitemplate_id ON notifications_no_pii USING btree (template_id);
And similarly for the history table. I then created a sepatate view
across both of these temporary tables using just these columns.
To test performance I created some queries that reflect what is run
by the billing [^5] and status [^6] tasks e.g.
explain analyze select template_id, sent_by, rate_multiplier, international, sum(billable_units), count(*)
from notifications_all_time_view
where
notification_status in ('sending', 'sent', 'delivered', 'pending', 'temporary-failure', 'permanent-failure')
and key_type in ('normal', 'team')
and created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
group by 1,2,3,4;
explain analyze select template_id, job_id, key_type, notification_status, count(*)
from notifications_all_time_view
where created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
and key_type in ('normal', 'team')
group by 1,2,3,4;
Between running queries I restarted my local database and also ran
a command to purge disk caches [^7].
I tested on a few services:
- c5956607-20b1-48b4-8983-85d11404e61f on 2022-05-02 (high volume)
- 0cc696c6-b792-409d-99e9-64232f461b0f on 2022-04-06 (highest volume)
- 01135db6-7819-4121-8b97-4aa2d741e372 on 2022-04-14 (very low volume)
All execution results are of the same magnitude using the view compared
to the worst case of either table on its own.
[^1]: 00a04ebf54/app/dao/notifications_dao.py (L389)
[^2]: https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all
[^3]: https://www.postgresql.org/docs/current/transaction-iso.html
[^4]: https://dba.stackexchange.com/questions/210485/can-sub-selects-change-in-one-single-query-in-a-read-committed-transaction
[^5]: 00a04ebf54/app/dao/fact_billing_dao.py (L471)
[^6]: 00a04ebf54/app/dao/fact_notification_status_dao.py (L58)
[^7]: https://stackoverflow.com/questions/28845524/echo-3-proc-sys-vm-drop-caches-on-mac-osx
This effectively reverts [^1], which was only a temporary change.
I suspect the performance problem will go away with [^2].
While we've clearly been managing without this change, it resulted
in several rows being left as incorrect when letter receipts were
delayed. It makes sense for us to run this task for the same period
as we do to aggregate statuses, as status affects billing.
[^1]: e5c76ffda7
[^2]: https://github.com/alphagov/notifications-api/pull/3542
The test was querying `FactNotificationStatus` and ordering the results
by bst_date and notification_type then checking the rows. However, the
bst_date and notification_type for each row is the same, so this test
could fail based on the order that the results came back in. By ordering
on the notification_status instead, we can be sure of the order of the
results.
This takes a similar approach to the nightly deletion task so that
we only create sub-tasks when there are actually notifications to
aggregate for a given type and day [1].
We're making this change to stop the duplication errors we're getting
at the moment and ensure the task can scale to more messages and more
services. There are two parts to this:
- Each subtask should now run within the 5 minute visibility timeout.
However, they may still be duplicated if the parent task overruns [2].
- The parent task creates a mininal number of subtasks, and the query
to determine this is very fast for a normal process day (milliseconds).
Since all tasks will run quickly, there should be no more duplication.
In order to test this more nuanced task, I rewrote the tests:
- One test checks the subtask is called correctly.
- One test checks we create all the right subtasks.
[1]: https://github.com/alphagov/notifications-api/pull/3381
[2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
This follows a similar approach as [1]. Recently we've seen lots
of errors from this task, which we think are a consequence of it
doing too much work and tripping Celery's visibility timeout.
While we can optimise the query [2], it's likely the errors will
return as the number of live services grows. Parallelising the
aggregation now will make it more futureproof.
[1]: https://github.com/alphagov/notifications-api/pull/3397
[2]: https://github.com/alphagov/notifications-api/pull/3417
The previous DAO tests were also confusing because they were testing
two functions at the same time, so moving the tests up to the task
level seems very reasonable, and will make it easier to change how
this code works in the next commits.
for the service.
The letters rates for cronw and non crown are the same. It would be nice
to remove the need for crown but for now this is a quick fix.
Flake8 Bugbear checks for some extra things that aren’t code style
errors, but are likely to introduce bugs or unexpected behaviour. A
good example is having mutable default function arguments, which get
shared between every call to the function and therefore mutating a value
in one place can unexpectedly cause it to change in another.
This commit enables all the extra warnings provided by Flake8 Bugbear,
except for:
- the line length one (because we already lint for that separately)
- B903 Data class should either be immutable or use `__slots__` because
this seems to false-positive on some of our custom exceptions
- B902 Invalid first argument 'cls' used for instance method because
some SQLAlchemy decorators (eg `declared_attr`) make things that
aren’t formally class methods take a class not an instance as their
first argument
It disables:
- _B306: BaseException.message is removed in Python 3_ because I think
our exceptions have a custom structure that means the `.message`
attribute is still present
Matches the work done in other repos:
- https://github.com/alphagov/notifications-admin/pull/3172/files
New rows giving the prices of letters with a post_class of `europe` and
`rest-of-world` have been added to the `letter_rates` table. All rates
are currently the same for international letters.
the queries all return lots of columns, but each query has columns it
doesn't care about. eg emails don't have billable units or international
flag, letters don't have international flag, sms don't have a page count
etc. additionally, the query was grouping on things that never change,
like service id and notification type.
by making all of these literals (as in `select 1 as foo`) we see times
that are over 50% quicker for gov.uk email service.
Note: One of the tests changed because previously it involved emails and
sms with statuses that they could never be (eg returned-letter)
sms and emails have a very predictable 72 hour lifecycle. letters, on
the other hand, have ridiculously complex lifecycles - they might not
get sent because it's a weekend, they might not get sent because they're
second class and are only processed on alternate days, they might not
get sent because a different letter in the same batch had an error that
we didn't know about. Either way, it's apparent that four days is
definitely not enough time to guarantee that letters have gone from
sending to delivered.
Extend the amount of days we process for letters to 10 days. Keep emails
and sms down at 4 to keep run-times shorter
We're deliberately not thinking about returned letters here at all.
it makes less sense once we introduce different start dates for letters
and emails. Also, we never use it, since we just call the day tasks
ourselves from commands.py
the nightly task won't be affected, it'll just trigger three times more
sub-tasks.
this doesn't need to be a two-part deploy because we only trigger this
overnight, so as long as the deploy completes in daytime we don't need
to worry about celery task signatures
the nightly tasks need to run after the create nightly notification
status task - so that test notifications are still there to record
stats for, and to stop the risk of deleting notificaitons part-way
through recording stats for them.
the create_nightly_notification_status task runs at 00:30am UK time,
however this means that in summer datetime.today() will return the
wrong date as the server (which runs on UTC) will run the task at
23:30 (populating the wrong row in the table).
fix this to use nice tz aware functions