This deletes a big ol' chunk of code related to letters. It's not everything—there are still a few things that might be tied to sms/email—but it's the the heart of letters function. SMS and email function should be untouched by this.
Areas affected:
- Things obviously about letters
- PDF tasks, used for precompiling letters
- Virus scanning, used for those PDFs
- FTP, used to send letters to the printer
- Postage stuff
This fixes a bug where (letter) notifications left in sending would
temporarily get excluded from billing and status calculations once
the service retention period had elapsed, and then get included once
again when they finally get marked as delivered.*
Status and billing tasks shouldn't need to have knowledge about which
table their data is in and getting this wrong is the fundamental cause
of the bug here. Adding a view across both tables abstracts this away
while keeping the query complexity the same.
Using a view also has the added benefit that we no longer need to care
when the status / billing tasks run in comparison to the deletion task,
since we will retrieve the same data irrespective (see below for a more
detailed discussion on data integrity).
*Such a scenario is rare but has happened.
A New View
==========
I've included all the columns that are shared between the two tables,
even though only a subset are actually needed. Having extra columns
has no impact and may be useful in future.
Although the view isn't actually a table, SQLAlchemy appears to wrap
it without any issues, noting that the package doesn't have any direct
support for "view models". Because we're never inserting data, we don't
need most of the kwargs when defining columns.*
*Note that the "default" kwarg doesn't affect data that's retrieved,
only data that's written (if no value is set).
Data Integrity
==============
The (new) tests cover the main scenarios.
We need to be careful with how the view interacts with the deletion /
archiving task. There are two concerns here:
- Duplicates. The deletion task inserts before it deletes [^1], so we
could end up double counting. It turns out this isn't a problem because
a Postgres UNION is an implicit "DISTINCT" [^2]. I've also verified this
manually, just to be on the safe side.
- No data. It's conceivable that the query will check the history table
just before the insertion, then check the notifications table just after
the deletion. It turns out this isn't a problem either because the whole
query sees the same DB snapshot [^3][^4].*
*I can't think of a way to test this as it's a race condition, but I'm
confident the Postgres docs are accurate.
Performance
===========
I copied the relevant (non-PII) columns from Production for data going
back to 2022-04-01. I then ran several tests.
Queries using the new view still make use of indices on a per-table basis,
as the following query plan illustrates:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1130820.02..1135353.89 rows=46502 width=97) (actual time=629.863..756.703 rows=72 loops=1)
Group Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
-> Sort (cost=1130820.02..1131401.28 rows=232506 width=85) (actual time=629.756..708.914 rows=217563 loops=1)
Sort Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
Sort Method: external merge Disk: 9320kB
-> Subquery Scan on notifications_all_time_view (cost=1088506.43..1098969.20 rows=232506 width=85) (actual time=416.118..541.669 rows=217563 loops=1)
-> Unique (cost=1088506.43..1096644.14 rows=232506 width=725) (actual time=416.115..513.065 rows=217563 loops=1)
-> Sort (cost=1088506.43..1089087.70 rows=232506 width=725) (actual time=416.115..451.190 rows=217563 loops=1)
Sort Key: notifications_no_pii.id, notifications_no_pii.job_id, notifications_no_pii.service_id, notifications_no_pii.template_id, notifications_no_pii.key_type, notifications_no_pii.billable_units, notifications_no_pii.notification_type, notifications_no_pii.created_at, notifications_no_pii.sent_by, notifications_no_pii.notification_status, notifications_no_pii.international, notifications_no_pii.rate_multiplier, notifications_no_pii.postage
Sort Method: external merge Disk: 23936kB
-> Append (cost=114.42..918374.12 rows=232506 width=725) (actual time=2.051..298.229 rows=217563 loops=1)
-> Bitmap Heap Scan on notifications_no_pii (cost=114.42..8557.55 rows=2042 width=113) (actual time=1.405..1.442 rows=0 loops=1)
Recheck Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: ((key_type)::text = ANY ('{normal,team}'::text[]))
-> Bitmap Index Scan on ix_notifications_no_piiservice_id_composite (cost=0.00..113.91 rows=2202 width=0) (actual time=1.402..1.439 rows=0 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
-> Index Scan using ix_notifications_history_no_pii_service_id_composite on notifications_history_no_pii (cost=0.70..906328.97 rows=230464 width=113) (actual time=0.645..281.612 rows=217563 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND ((key_type)::text = ANY ('{normal,team}'::text[])) AND (notification_type = 'sms'::notification_type) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[]))
Planning Time: 18.032 ms
Execution Time: 759.001 ms
(21 rows)
Queries using the new view appear to be slower than without, but the
differences I've seen are minimal: the original queries execute in
seconds locally and in Production, so it's not a big issue.
Notes: Performance
==================
I downloaded a minimal set of columns for testing:
\copy (
select
id, notification_type, key_type, created_at, service_id,
template_id, sent_by, rate_multiplier, international,
billable_units, postage, job_id, notification_status
from notifications
) to 'notifications.csv' delimiter ',' csv header;
CREATE TABLE notifications_no_pii (
id uuid NOT NULL,
notification_type public.notification_type NOT NULL,
key_type character varying(255) NOT NULL,
created_at timestamp without time zone NOT NULL,
service_id uuid,
template_id uuid,
sent_by character varying,
rate_multiplier numeric,
international boolean,
billable_units integer NOT NULL,
postage character varying,
job_id uuid,
notification_status text
);
copy notifications_no_pii from '/Users/ben.thorner/Desktop/notifications.csv' delimiter ',' csv header;
CREATE INDEX ix_notifications_no_piicreated_at ON notifications_no_pii USING btree (created_at);
CREATE INDEX ix_notifications_no_piijob_id ON notifications_no_pii USING btree (job_id);
CREATE INDEX ix_notifications_no_piinotification_type_composite ON notifications_no_pii USING btree (notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piiservice_created_at ON notifications_no_pii USING btree (service_id, created_at);
CREATE INDEX ix_notifications_no_piiservice_id_composite ON notifications_no_pii USING btree (service_id, notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piitemplate_id ON notifications_no_pii USING btree (template_id);
And similarly for the history table. I then created a sepatate view
across both of these temporary tables using just these columns.
To test performance I created some queries that reflect what is run
by the billing [^5] and status [^6] tasks e.g.
explain analyze select template_id, sent_by, rate_multiplier, international, sum(billable_units), count(*)
from notifications_all_time_view
where
notification_status in ('sending', 'sent', 'delivered', 'pending', 'temporary-failure', 'permanent-failure')
and key_type in ('normal', 'team')
and created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
group by 1,2,3,4;
explain analyze select template_id, job_id, key_type, notification_status, count(*)
from notifications_all_time_view
where created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
and key_type in ('normal', 'team')
group by 1,2,3,4;
Between running queries I restarted my local database and also ran
a command to purge disk caches [^7].
I tested on a few services:
- c5956607-20b1-48b4-8983-85d11404e61f on 2022-05-02 (high volume)
- 0cc696c6-b792-409d-99e9-64232f461b0f on 2022-04-06 (highest volume)
- 01135db6-7819-4121-8b97-4aa2d741e372 on 2022-04-14 (very low volume)
All execution results are of the same magnitude using the view compared
to the worst case of either table on its own.
[^1]: 00a04ebf54/app/dao/notifications_dao.py (L389)
[^2]: https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all
[^3]: https://www.postgresql.org/docs/current/transaction-iso.html
[^4]: https://dba.stackexchange.com/questions/210485/can-sub-selects-change-in-one-single-query-in-a-read-committed-transaction
[^5]: 00a04ebf54/app/dao/fact_billing_dao.py (L471)
[^6]: 00a04ebf54/app/dao/fact_notification_status_dao.py (L58)
[^7]: https://stackoverflow.com/questions/28845524/echo-3-proc-sys-vm-drop-caches-on-mac-osx
Previously we were looping over data from the Notifications/History
table and then shovelling it into the status table, one row at a time
- plus an extra delete to clean up any existing data.
This replaces that with a batch insertion, similar to how we archive
notifications [1], but using a simple subquery (via "from_select" [2])
instead of a temporary table.
To make the select compatible with the insert, I've used "literal"
to inject the constant pieces of data, so each row has everything it
needs to go into the status table.
[1]: 9ce6d2fe92/app/dao/notifications_dao.py (L295)
[2]: https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select
This is consistent with the way we do billing updates [1] and is a
bit less clunky. Functionally it should be the same - note that the
tests already cover the "overwriting" behaviour if a row exists.
[1]: 9ce6d2fe92/app/dao/fact_billing_dao.py (L522)
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification
history table shows this is another instance of [1] but for the key
type column. Swapping "!=" for "IN" solves the problem.
[1]: https://github.com/alphagov/notifications-api/pull/3360
The trouble is the aggregate query to return the big blue numbers on the dashboard and /notifications/{notification_type} page is taking too long to return.
I have some ideas on how to improve the query, but should take some time to do some more research and test. In the meantime, let's just ignore "todays" total numbers for the high volume services. There are only two services that this will affect.
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats
We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
This done so that we do not use statsd on our http endpoint.
We decided we do not need metrics that this gave us. If we
change our minds, we will add Prometheus-friendly decorators
instead in the future.
The '/service/monthly-data-by-service` endpoint which is used for the
'Monthly notification statuses for live services' Platform Admin report
did not including `pending` notifications. This updates the DAO function
that the endpoint calls to group `pending` and `sending` notifications together.
the nightly task won't be affected, it'll just trigger three times more
sub-tasks.
this doesn't need to be a two-part deploy because we only trigger this
overnight, so as long as the deploy completes in daytime we don't need
to worry about celery task signatures
previously we checked notifications table, and if the results were
zero, checked the notification history table to see if there's data
in there. When we know that data isn't in notifications, we're still
checking. These queries take half a second per service, and we're
doing at least ten for each of the five thousand services we have in
notify. Most of these services have no data in either table for any
given day, and we can reduce the amount of queries we do by only
checking one table.
Check the data retention for a service, and then if the date is older
than the retention, get from history table.
NOTE: This requires that the delete tasks haven't run yet for the day!
If your retention is three days, this will look in the Notification
table for data from three days ago - expecting that shortly after the
task finishes, we'll delete that data.
This PR updates those queries to look in either Notification or NotificationHistory. Since the data does not exist in both tables at the same time we can do with and not worry about the data retention.
The query will iterate over each service, then each notification type and query the data if no results then try the history table.
the create_nightly_notification_status task runs at 00:30am UK time,
however this means that in summer datetime.today() will return the
wrong date as the server (which runs on UTC) will run the task at
23:30 (populating the wrong row in the table).
fix this to use nice tz aware functions
The previous query was including all notifications regardless of notification_status. I don't think that's right, it shouldn't include things like technical-failure or validation-failed. Thoughts?
I also need to remove the query that's no longer being used.
Currently, admin app requests service statistics (with notification
counts grouped by status) and template statistics (with counts by
template) in order to display the service dashboard.
Service statistics are gathered from FactNotificationStatus table
(counts for the last 7 days) combined with Notification (counts for
today).
Template statistics are currently gathered from redis cache, which
contains a separate counter per template per day. It's hard for us
to maintain consistency between redis and DB counts. Currently it
doesn't update the count for cancelled letters, counter resets in
the middle of the day might produce a wrong result for the rest of
the week and cleared redis cache can't be repopulated for services
with low data retention periods).
Since FactNotificationStatus already contains separate counts for
each template_id we can use the existing logic with some additional
filters to get separate counts for each template and status combination,
which would allow us to populate the service dashboard page from one
query response.