This renames the latest migration file to match the Revision ID in the
file. When these names are different, our deployment pipeline tries to
run migrations on the notify-api-db-migration app and runs the
functional tests twice.
This fixes a bug where (letter) notifications left in sending would
temporarily get excluded from billing and status calculations once
the service retention period had elapsed, and then get included once
again when they finally get marked as delivered.*
Status and billing tasks shouldn't need to have knowledge about which
table their data is in and getting this wrong is the fundamental cause
of the bug here. Adding a view across both tables abstracts this away
while keeping the query complexity the same.
Using a view also has the added benefit that we no longer need to care
when the status / billing tasks run in comparison to the deletion task,
since we will retrieve the same data irrespective (see below for a more
detailed discussion on data integrity).
*Such a scenario is rare but has happened.
A New View
==========
I've included all the columns that are shared between the two tables,
even though only a subset are actually needed. Having extra columns
has no impact and may be useful in future.
Although the view isn't actually a table, SQLAlchemy appears to wrap
it without any issues, noting that the package doesn't have any direct
support for "view models". Because we're never inserting data, we don't
need most of the kwargs when defining columns.*
*Note that the "default" kwarg doesn't affect data that's retrieved,
only data that's written (if no value is set).
Data Integrity
==============
The (new) tests cover the main scenarios.
We need to be careful with how the view interacts with the deletion /
archiving task. There are two concerns here:
- Duplicates. The deletion task inserts before it deletes [^1], so we
could end up double counting. It turns out this isn't a problem because
a Postgres UNION is an implicit "DISTINCT" [^2]. I've also verified this
manually, just to be on the safe side.
- No data. It's conceivable that the query will check the history table
just before the insertion, then check the notifications table just after
the deletion. It turns out this isn't a problem either because the whole
query sees the same DB snapshot [^3][^4].*
*I can't think of a way to test this as it's a race condition, but I'm
confident the Postgres docs are accurate.
Performance
===========
I copied the relevant (non-PII) columns from Production for data going
back to 2022-04-01. I then ran several tests.
Queries using the new view still make use of indices on a per-table basis,
as the following query plan illustrates:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1130820.02..1135353.89 rows=46502 width=97) (actual time=629.863..756.703 rows=72 loops=1)
Group Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
-> Sort (cost=1130820.02..1131401.28 rows=232506 width=85) (actual time=629.756..708.914 rows=217563 loops=1)
Sort Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
Sort Method: external merge Disk: 9320kB
-> Subquery Scan on notifications_all_time_view (cost=1088506.43..1098969.20 rows=232506 width=85) (actual time=416.118..541.669 rows=217563 loops=1)
-> Unique (cost=1088506.43..1096644.14 rows=232506 width=725) (actual time=416.115..513.065 rows=217563 loops=1)
-> Sort (cost=1088506.43..1089087.70 rows=232506 width=725) (actual time=416.115..451.190 rows=217563 loops=1)
Sort Key: notifications_no_pii.id, notifications_no_pii.job_id, notifications_no_pii.service_id, notifications_no_pii.template_id, notifications_no_pii.key_type, notifications_no_pii.billable_units, notifications_no_pii.notification_type, notifications_no_pii.created_at, notifications_no_pii.sent_by, notifications_no_pii.notification_status, notifications_no_pii.international, notifications_no_pii.rate_multiplier, notifications_no_pii.postage
Sort Method: external merge Disk: 23936kB
-> Append (cost=114.42..918374.12 rows=232506 width=725) (actual time=2.051..298.229 rows=217563 loops=1)
-> Bitmap Heap Scan on notifications_no_pii (cost=114.42..8557.55 rows=2042 width=113) (actual time=1.405..1.442 rows=0 loops=1)
Recheck Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: ((key_type)::text = ANY ('{normal,team}'::text[]))
-> Bitmap Index Scan on ix_notifications_no_piiservice_id_composite (cost=0.00..113.91 rows=2202 width=0) (actual time=1.402..1.439 rows=0 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
-> Index Scan using ix_notifications_history_no_pii_service_id_composite on notifications_history_no_pii (cost=0.70..906328.97 rows=230464 width=113) (actual time=0.645..281.612 rows=217563 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND ((key_type)::text = ANY ('{normal,team}'::text[])) AND (notification_type = 'sms'::notification_type) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[]))
Planning Time: 18.032 ms
Execution Time: 759.001 ms
(21 rows)
Queries using the new view appear to be slower than without, but the
differences I've seen are minimal: the original queries execute in
seconds locally and in Production, so it's not a big issue.
Notes: Performance
==================
I downloaded a minimal set of columns for testing:
\copy (
select
id, notification_type, key_type, created_at, service_id,
template_id, sent_by, rate_multiplier, international,
billable_units, postage, job_id, notification_status
from notifications
) to 'notifications.csv' delimiter ',' csv header;
CREATE TABLE notifications_no_pii (
id uuid NOT NULL,
notification_type public.notification_type NOT NULL,
key_type character varying(255) NOT NULL,
created_at timestamp without time zone NOT NULL,
service_id uuid,
template_id uuid,
sent_by character varying,
rate_multiplier numeric,
international boolean,
billable_units integer NOT NULL,
postage character varying,
job_id uuid,
notification_status text
);
copy notifications_no_pii from '/Users/ben.thorner/Desktop/notifications.csv' delimiter ',' csv header;
CREATE INDEX ix_notifications_no_piicreated_at ON notifications_no_pii USING btree (created_at);
CREATE INDEX ix_notifications_no_piijob_id ON notifications_no_pii USING btree (job_id);
CREATE INDEX ix_notifications_no_piinotification_type_composite ON notifications_no_pii USING btree (notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piiservice_created_at ON notifications_no_pii USING btree (service_id, created_at);
CREATE INDEX ix_notifications_no_piiservice_id_composite ON notifications_no_pii USING btree (service_id, notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piitemplate_id ON notifications_no_pii USING btree (template_id);
And similarly for the history table. I then created a sepatate view
across both of these temporary tables using just these columns.
To test performance I created some queries that reflect what is run
by the billing [^5] and status [^6] tasks e.g.
explain analyze select template_id, sent_by, rate_multiplier, international, sum(billable_units), count(*)
from notifications_all_time_view
where
notification_status in ('sending', 'sent', 'delivered', 'pending', 'temporary-failure', 'permanent-failure')
and key_type in ('normal', 'team')
and created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
group by 1,2,3,4;
explain analyze select template_id, job_id, key_type, notification_status, count(*)
from notifications_all_time_view
where created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
and key_type in ('normal', 'team')
group by 1,2,3,4;
Between running queries I restarted my local database and also ran
a command to purge disk caches [^7].
I tested on a few services:
- c5956607-20b1-48b4-8983-85d11404e61f on 2022-05-02 (high volume)
- 0cc696c6-b792-409d-99e9-64232f461b0f on 2022-04-06 (highest volume)
- 01135db6-7819-4121-8b97-4aa2d741e372 on 2022-04-14 (very low volume)
All execution results are of the same magnitude using the view compared
to the worst case of either table on its own.
[^1]: 00a04ebf54/app/dao/notifications_dao.py (L389)
[^2]: https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all
[^3]: https://www.postgresql.org/docs/current/transaction-iso.html
[^4]: https://dba.stackexchange.com/questions/210485/can-sub-selects-change-in-one-single-query-in-a-read-committed-transaction
[^5]: 00a04ebf54/app/dao/fact_billing_dao.py (L471)
[^6]: 00a04ebf54/app/dao/fact_notification_status_dao.py (L58)
[^7]: https://stackoverflow.com/questions/28845524/echo-3-proc-sys-vm-drop-caches-on-mac-osx
this was added five years ago but never used. if we want to bring back
variable rates per client we might as well get a fresh start since a lot
has changed since then.
This is done to make self-service branding easier to implement,
and also because NHS branding makes much more sense for services
in those orgs than GOV.UK branding.
This works in conjunction with the new SMS provider stub [^1].
Local testing:
- Run the migrations to add Reach as an inactive provider.
- Activate the Reach provider locally and deactivate the others.
update provider_details set priority = 100, active = false where notification_type = 'sms';
update provider_details set active = true where identifier = 'reach';
- Tweak your local environment to point at the SMS stub.
export REACH_URL="http://host.docker.internal:6300/reach"
- Start / restart Celery to pick up the config change.
- Send a SMS via the Admin app and see the stub log it.
- Reset your environment so you can send normal SMS.
update provider_details set active = true where notification_type = 'sms';
update provider_details set active = false where identifier = 'reach';
[^1]: https://github.com/alphagov/notifications-sms-provider-stub/pull/10
- second class postage will go up by 2 pence, plus VAT
- international postage will go up by 7 pence, plus VAT
- first class postage will go down by 6 pence, plus VAT
With the changes we will be making to email branding in admin it's
useful to know the ID of the NHS email brand. This makes it easier to
show a preview of the branding.
This means that the ID will need to be consistent across environments,
so this changes the details of the NHS brand in the dev, preview and
staging environments to match the production data.
The migration
- Removes NHS branding from existing services
- Removes NHS branding from any orgs which had it as their default
- Deletes the NHS branding row if it exists
- Inserts a new NHS branding row with details which match those in
production
This does have the effect of removing NHS branding from people's local
and preview services and orgs (NHS branding doesn't exist on staging),
but this does not affect real teams. The NHS logo with a filename of
'1ac6f483-3105-4c9e-9017-dd7fb2752c44-nhs-blue_x2.png' exists in the
logo S3 bucket for all environments.
This column has been superseded by a new column named
created_by_api_key_id.
Also create constraint checking that we know who created broadcast
Also move data so that constraint is met before instatiating it.
table. Also add created_by_api_key_id column that will supersede
api_key_id_column.
We do migration and code in separate PRs to make it easier to
downgrade if anything goes wrong.
Add check constraint that created_by_id should not be null, unless
created_by_api_key_id is not null to the migration script. It is
already in the models file.
Also remove check constraint for cancelled_by_id from models, as
this field would only be filled for broadcasts with cancelled
status.
Also add some spacing in that migration script so it is easier
to read.
We don’t store everything that comes in the CAP XML when someone creates
a broadcast via the API.
One thing we do store is `<identifier>` (in a column called `reference`)
which is a unique (to the external system) identifier for the broadcast.
We show this in the front end instead of the template name, because
broadcasts created from the API don’t use templates.
However this ID isn’t very friendly – the Environment Agency just supply
a UUID.
The Environment Agency also populate the `<event>` field with some human
readable text, for example:
> 013 Issue Severe Flood Warning EA
(013 is an area code which will be meaningful to the Flood Warning
Service team)
We should show this in the UI instead of the reference. The first step
towards this is storing it in the database and returning it in the REST
endpoints.
Later we can have the admin app prefer `cap_event` over `reference`,
where `cap_event` is present.
We can’t backfill this data because we don’t keep a copy of the original
XML.
Seems like `<event>` is a mandatory property of `<info>`, so we don’t
need to worry about the field being missing (`<info>` is optional in
CAP but we require it because it contains stuff like the areas which
we need in order to send out the broadcast`).
***
https://www.pivotaltracker.com/story/show/176927060
The broadcast user permissions are changing, so to avoid confusion and
permissions which exist in the database but don't display on the
frontend we are going to remove all existing permissions for users of
broadcast services. This also updates the permissions of invited users
who are still pending.
The exception to this is the `view_activity` permission, which we always
add for broadcast users even if they have no other permissions.
(aad017a184/app/main/forms.py (L1043))
Drop index concurrently will drop the index without locking out concurrent selects, inserts, updates, and deletes on the index's table namely on notifications.