This fixes a bug where (letter) notifications left in sending would
temporarily get excluded from billing and status calculations once
the service retention period had elapsed, and then get included once
again when they finally get marked as delivered.*
Status and billing tasks shouldn't need to have knowledge about which
table their data is in and getting this wrong is the fundamental cause
of the bug here. Adding a view across both tables abstracts this away
while keeping the query complexity the same.
Using a view also has the added benefit that we no longer need to care
when the status / billing tasks run in comparison to the deletion task,
since we will retrieve the same data irrespective (see below for a more
detailed discussion on data integrity).
*Such a scenario is rare but has happened.
A New View
==========
I've included all the columns that are shared between the two tables,
even though only a subset are actually needed. Having extra columns
has no impact and may be useful in future.
Although the view isn't actually a table, SQLAlchemy appears to wrap
it without any issues, noting that the package doesn't have any direct
support for "view models". Because we're never inserting data, we don't
need most of the kwargs when defining columns.*
*Note that the "default" kwarg doesn't affect data that's retrieved,
only data that's written (if no value is set).
Data Integrity
==============
The (new) tests cover the main scenarios.
We need to be careful with how the view interacts with the deletion /
archiving task. There are two concerns here:
- Duplicates. The deletion task inserts before it deletes [^1], so we
could end up double counting. It turns out this isn't a problem because
a Postgres UNION is an implicit "DISTINCT" [^2]. I've also verified this
manually, just to be on the safe side.
- No data. It's conceivable that the query will check the history table
just before the insertion, then check the notifications table just after
the deletion. It turns out this isn't a problem either because the whole
query sees the same DB snapshot [^3][^4].*
*I can't think of a way to test this as it's a race condition, but I'm
confident the Postgres docs are accurate.
Performance
===========
I copied the relevant (non-PII) columns from Production for data going
back to 2022-04-01. I then ran several tests.
Queries using the new view still make use of indices on a per-table basis,
as the following query plan illustrates:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=1130820.02..1135353.89 rows=46502 width=97) (actual time=629.863..756.703 rows=72 loops=1)
Group Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
-> Sort (cost=1130820.02..1131401.28 rows=232506 width=85) (actual time=629.756..708.914 rows=217563 loops=1)
Sort Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
Sort Method: external merge Disk: 9320kB
-> Subquery Scan on notifications_all_time_view (cost=1088506.43..1098969.20 rows=232506 width=85) (actual time=416.118..541.669 rows=217563 loops=1)
-> Unique (cost=1088506.43..1096644.14 rows=232506 width=725) (actual time=416.115..513.065 rows=217563 loops=1)
-> Sort (cost=1088506.43..1089087.70 rows=232506 width=725) (actual time=416.115..451.190 rows=217563 loops=1)
Sort Key: notifications_no_pii.id, notifications_no_pii.job_id, notifications_no_pii.service_id, notifications_no_pii.template_id, notifications_no_pii.key_type, notifications_no_pii.billable_units, notifications_no_pii.notification_type, notifications_no_pii.created_at, notifications_no_pii.sent_by, notifications_no_pii.notification_status, notifications_no_pii.international, notifications_no_pii.rate_multiplier, notifications_no_pii.postage
Sort Method: external merge Disk: 23936kB
-> Append (cost=114.42..918374.12 rows=232506 width=725) (actual time=2.051..298.229 rows=217563 loops=1)
-> Bitmap Heap Scan on notifications_no_pii (cost=114.42..8557.55 rows=2042 width=113) (actual time=1.405..1.442 rows=0 loops=1)
Recheck Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: ((key_type)::text = ANY ('{normal,team}'::text[]))
-> Bitmap Index Scan on ix_notifications_no_piiservice_id_composite (cost=0.00..113.91 rows=2202 width=0) (actual time=1.402..1.439 rows=0 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
-> Index Scan using ix_notifications_history_no_pii_service_id_composite on notifications_history_no_pii (cost=0.70..906328.97 rows=230464 width=113) (actual time=0.645..281.612 rows=217563 loops=1)
Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND ((key_type)::text = ANY ('{normal,team}'::text[])) AND (notification_type = 'sms'::notification_type) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
Filter: (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[]))
Planning Time: 18.032 ms
Execution Time: 759.001 ms
(21 rows)
Queries using the new view appear to be slower than without, but the
differences I've seen are minimal: the original queries execute in
seconds locally and in Production, so it's not a big issue.
Notes: Performance
==================
I downloaded a minimal set of columns for testing:
\copy (
select
id, notification_type, key_type, created_at, service_id,
template_id, sent_by, rate_multiplier, international,
billable_units, postage, job_id, notification_status
from notifications
) to 'notifications.csv' delimiter ',' csv header;
CREATE TABLE notifications_no_pii (
id uuid NOT NULL,
notification_type public.notification_type NOT NULL,
key_type character varying(255) NOT NULL,
created_at timestamp without time zone NOT NULL,
service_id uuid,
template_id uuid,
sent_by character varying,
rate_multiplier numeric,
international boolean,
billable_units integer NOT NULL,
postage character varying,
job_id uuid,
notification_status text
);
copy notifications_no_pii from '/Users/ben.thorner/Desktop/notifications.csv' delimiter ',' csv header;
CREATE INDEX ix_notifications_no_piicreated_at ON notifications_no_pii USING btree (created_at);
CREATE INDEX ix_notifications_no_piijob_id ON notifications_no_pii USING btree (job_id);
CREATE INDEX ix_notifications_no_piinotification_type_composite ON notifications_no_pii USING btree (notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piiservice_created_at ON notifications_no_pii USING btree (service_id, created_at);
CREATE INDEX ix_notifications_no_piiservice_id_composite ON notifications_no_pii USING btree (service_id, notification_type, notification_status, created_at);
CREATE INDEX ix_notifications_no_piitemplate_id ON notifications_no_pii USING btree (template_id);
And similarly for the history table. I then created a sepatate view
across both of these temporary tables using just these columns.
To test performance I created some queries that reflect what is run
by the billing [^5] and status [^6] tasks e.g.
explain analyze select template_id, sent_by, rate_multiplier, international, sum(billable_units), count(*)
from notifications_all_time_view
where
notification_status in ('sending', 'sent', 'delivered', 'pending', 'temporary-failure', 'permanent-failure')
and key_type in ('normal', 'team')
and created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
group by 1,2,3,4;
explain analyze select template_id, job_id, key_type, notification_status, count(*)
from notifications_all_time_view
where created_at >= '2022-05-01 23:00'
and created_at < '2022-05-02 23:00'
and notification_type = 'sms'
and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
and key_type in ('normal', 'team')
group by 1,2,3,4;
Between running queries I restarted my local database and also ran
a command to purge disk caches [^7].
I tested on a few services:
- c5956607-20b1-48b4-8983-85d11404e61f on 2022-05-02 (high volume)
- 0cc696c6-b792-409d-99e9-64232f461b0f on 2022-04-06 (highest volume)
- 01135db6-7819-4121-8b97-4aa2d741e372 on 2022-04-14 (very low volume)
All execution results are of the same magnitude using the view compared
to the worst case of either table on its own.
[^1]: 00a04ebf54/app/dao/notifications_dao.py (L389)
[^2]: https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all
[^3]: https://www.postgresql.org/docs/current/transaction-iso.html
[^4]: https://dba.stackexchange.com/questions/210485/can-sub-selects-change-in-one-single-query-in-a-read-committed-transaction
[^5]: 00a04ebf54/app/dao/fact_billing_dao.py (L471)
[^6]: 00a04ebf54/app/dao/fact_notification_status_dao.py (L58)
[^7]: https://stackoverflow.com/questions/28845524/echo-3-proc-sys-vm-drop-caches-on-mac-osx
This follows the same pattern as for status aggregations [^1]. We
haven't seen this problem for a long time because of [^2], but now
we're trying to re-run the aggregation for some incorrect rows it's
becoming apparent we need to fix it.
The following query currently fails in Production after the 30 min
SQLAlchemy timeout:
select template_id, rate_multiplier, international, sum(billable_units), count(*)
from notification_history
where notification_status in ('delivered', 'sending')
and key_type != 'test'
and notification_type = 'sms'
and service_id = '539d63a1-701d-400d-ab11-f3ee2319d4d4'
and created_at >= '2021-07-07 23:00'
and created_at < '2021-07-08 23:00'
group by 1,2,3,4;
Running a quick "explain analyze" with this change applied returns
near immediately, but hangs without it. This is enough evidence for
me that this change will fix the issue.
[^1]: https://github.com/alphagov/notifications-api/pull/3417
[^2]: e5c76ffda7
This can happen in the following scenario (primarily for letters):
1. A service has a mixture of "delivered" and "sending" letters,
which the status task aggregates into two rows:
sending | 123
delivered | 456
2. After the 7 day retention has passed, only the "delivered" letters
will be archived [^1].
3. The status task now looks at the history table [^2], which means
it only sees the "delivered" letters.
4. The "sending" letters are eventually "delivered" and archived (before
the 10 day aggregation cutoff).
5. But the status aggregation task doesn't run.
This commit fixes (5).
[^1]: https://github.com/alphagov/notifications-api/pull/3063
[^2]: f87ebb094d/app/dao/fact_notification_status_dao.py (L51)
We had the `only` defined in the Meta class, and this wasn't working -
any extra fields were also being loaded. This moves it to the point
where the class is instantiated, which now works.
We have a lot of cases in the schemas where we're excluding a field that
doesn't actually exist on the schema anyway. This is often because a
model has been deleted, and the schema has not been updated. These
excluded fields have no effect at the moment, but Marshmallow 3 does
raise an error if you try and exclude non-existent fields.
There should be no change to what gets (de)serialized after this change.
this was added five years ago but never used. if we want to bring back
variable rates per client we might as well get a fresh start since a lot
has changed since then.
This can be calculated from the "free_allowance_used" field and the
"chargeable_units" field, but having it included separately is more
convenient as it can be used directly in Admin [^1].
[^1]: 417e7370bb/app/templates/views/usage.html (L38-L39)
This represents the number of chargeable_units that were actually
free due to the free allowance - they won't be included in "cost".
Although the existing calculations in Admin [^1][^2] will still be
correct with a change in SMS rates - it's cost that's the problem
- it makes sense to have all the knowledge about calculating usage
consistently in these two APIs.
Note that the Integer casting is covered by the API-level tests in
test_rest.
[^1]: 474d7dfda8/app/main/views/dashboard.py (L490)
[^2]: c63660d56d/app/main/views/dashboard.py (L350)
This will replace the manual calculations in Admin [^1][^2] for SMS
and also in API [^3] for annual letter costs.
Doing the calculation here also means we correctly attribute free
allowance to the earliest rows in the billing table - Admin doesn't
know when a given rate was applied so can't do this without making
assumptions about when we change our rates.
Since the calculation now depends on annual billing, we need to
change all the tests to make sure a suitable row exists. I've also
adjusted the test data to match the assumption that there can only
be one SMS rate per bst_date.
Note about "OVER" clause
========================
Using "rows=" ("ROWS BETWEEN") makes more sense than "range=" as
we want the remainder to be incremental within each group in a
"GROUP BY" clause, as well as between groups i.e
# ROWS BETWEEN (arbitrary numbers to illustrate)
date=2021-04-03, units=3, cost=3.29
date=2021-04-03, units=2, cost=4.17
date=2021-04-04, units=2, cost=5.10
vs.
# RANGE BETWEEN
date=2021-04-03, units=3, cost=4.17
date=2021-04-03, units=2, cost=4.17
date=2021-04-04, units=2, cost=5.10
See [^4] for more details and examples.
[^1]: https://github.com/alphagov/notifications-admin/blob/master/app/templates/views/usage.html#L60
[^2]: 072c3b2079/app/billing/billing_schemas.py (L37)
[^3]: 474d7dfda8/app/templates/views/usage.html (L98)
[^4]: https://learnsql.com/blog/difference-between-rows-range-window-functions/
There is no such thing as a "billing unit". The data this field
contained was also a confusing mixture of two types:
- For emails and letters, it was just "notifications_sent".
- For SMS, it was the "chargeable_units" (billable * multiplier).
This replaces the single, ambiguous "billing_units" field with
"chargeable_units" and "notifications_sent" in both usage APIs.
Once Admin is using them we can remove the old field.
This makes it easier to extend each function with costs and free
allowances - especially for SMS.
I've chosen to duplicate the "WHERE" clause in each subquery vs.
the top-level query. This will make more sense in later commits
where we start adding free allowance calculations, which need to
be done on a yearly basis - knowledge the subqueries should have.
We want to query for service usage in the BST financial year:
2022-04-01T00:00:00+01:00 to 2023-03-31T23:59:59+01:00 =>
2022-04-01 to 2023-03-31 # bst_date
Previously we were only doing this explicitly for the monthly API
and it seemed like the yearly usage API was incorrectly querying:
2022-03-31T23:00:00+00:00 to 2023-03-30T23:00:00+00:00 =>
2022-03-31 to 2023-03-30 # "bst_date"
However, it turns out this isn't a problem for two reasons:
1. We've been lucky that none of our rates have changed since 2017,
which is long ago enough that no one would care.
2. There's a quirk somewhere in Sqlalchemy / Postgres that has been
compensating for the lack of explicit BST conversion.
To help ensure we do this consistently in future I've DRYed-up the
BST conversion into a new utility. I could have just hard-coded the
dates but it seemed strange to have the knowledge twice.
I've also adjusted the tests so they detect if we accidentally use
data from a different financial year. (2) is why none of the test
assertions actually need changing and users won't be affected.
Sqlalchemy / Postgres quirk
===========================
The following queries were run on the same data but results differ:
FactBilling.query.filter(FactBilling.bst_date >= datetime(2021,3,31,23,0), FactBilling.bst_date <= '2021-04-05').order_by(FactBilling.bst_date).first().bst_date
datetime.date(2021, 4, 1)
FactBilling.query.filter(FactBilling.bst_date >= '2021-03-31 23:00:00', FactBilling.bst_date <= '2021-04-05').order_by(FactBilling.bst_date).first().bst_date
datetime.date(2021, 3, 31)
Looking at the actual query for the first item above still suggests
the results should be the same, but for the use of "timestamp".
SELECT ...
FROM ft_billing
WHERE ft_billing.service_id = '16b60315-9dab-45d3-a609-e871fbbf5345'::uuid AND ft_billing.bst_date >= '2016-03-31T23:00:00'::timestamp AND ft_billing.bst_date <= '2017-03-31T22:59:59.999999'::timestamp AND ft_billing.notification_type IN ('email', 'letter') GROUP BY ft_billing.rate, ft_billing.notification_type UNION ALL SELECT sum(ft_billing.notifications_sent) AS notifications_sent, sum(ft_billing.billable_units * ft_billing.rate_multiplier) AS billable_units, ft_billing.rate AS ft_billing_rate, ft_billing.notification_type AS ft_billing_notification_type
FROM ft_billing
WHERE ft_billing.service_id = '16b60315-9dab-45d3-a609-e871fbbf5345'::uuid AND ft_billing.bst_date >= '2016-03-31T23:00:00'::timestamp AND ft_billing.bst_date <= '2017-03-31T22:59:59.999999'::timestamp AND ft_billing.notification_type = 'sms' GROUP BY ft_billing.rate, ft_billing.notification_type) AS anon_1 ORDER BY anon_1.notification_type, anon_1.rate
If we try some manual queries with and without '::timestamp' we get:
select distinct(bst_date) from ft_billing where bst_date >= '2022-04-20T23:00:00' order by bst_date desc;
bst_date
------------
2022-04-21
2022-04-20
select distinct(bst_date) from ft_billing where bst_date >= '2022-04-20T23:00:00'::timestamp order by bst_date desc;
bst_date
------------
2022-04-21
2022-04-20
It looks like this is happening because all client connections are
aware of the local timezone, and naive datetimes are interpreted as
being in UTC - not necessarily true, but saves us here!
The monthly API datetimes were pre-converted to dates, so none of
this was relevant for deciding exactly which date to use.
you can still use this flag locally but we have it enabled for all
environments and it doesn't need to be toggleable from credentials as it
isn't a secret value.
If we wish to turn redis off for a specific environment we can create a
PR to change the config.
The big breaking change for our code (not mentioned in the changelog) is
that the built-in validator for the `date-time` format now requires the
`rfc3339-validator` package instead of the `strict-rfc3339` package.
This updates the requirements file to use `rfc3339-validator`. Without
this change, wrong `date-time` formats would always silently pass validation.
We were using the Draft4Validator in one place, so this updates it to
the Draft7Validator instead.
The schemas were mostly using draft 4 of the JSON schema, though there
were a couple of schemas that were already of version 7. This updates
them all to version 7, which is the latest version fully supported by
the jsonschema Python package. There are some breaking changes in the
newer version of the schema, but I could not see anywhere would these
affect us. Some of these schemas were not valid in version 4, but are
now valid in version 7 because `"required": []` was not valid in earlier
versions.
This is to support a migration from Redislabs to PaaS native Redis,
allowing us to toggle between old and new using the env vars for
the instance - without needing to change the code.
Fixes this error (in Admin):
File "/home/vcap/app/app/notify_client/notification_api_client.py", line 74, in send_precompiled_letter
return self.post(url='/service/{}/send-pdf-letter'.format(service_id), data=data)
File "/home/vcap/app/app/notify_client/__init__.py", line 59, in post
return super().post(*args, **kwargs)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_python_client/base.py", line 48, in post
return self.request("POST", url, data=data)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_python_client/base.py", line 64, in request
response = self._perform_request(method, url, kwargs)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_python_client/base.py", line 118, in _perform_request
raise api_error
notifications_python_client.errors.HTTPError: 500 - Internal server error
Due to this error (in API):
File "/home/vcap/app/app/service/send_notification.py", line 178, in send_pdf_letter_notification
raise e
File "/home/vcap/app/app/service/send_notification.py", line 173, in send_pdf_letter_notification
letter = utils_s3download(current_app.config['TRANSIENT_UPLOADED_LETTERS'], file_location)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_utils/s3.py", line 53, in s3download
raise S3ObjectNotFound(error.response, error.operation_name)
notifications_utils.s3.S3ObjectNotFound: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
I checked the DB to verify the letter does actually exist i.e. it
is an instance of the problem we're fixing here.
We previously decided against this [^1] but the lambda alerts are
not a replacement for this one.
Even if both sets of alerts go off, we expect PagerDuty to aggregate
them into a single notification.
[^1]: 65c21f694c
It's worth noting this will break the Admin page to edit provider
priorities, which currently works in units of 10. We already have
work planned to fix this [^1] and it's not an immediate problem.
The automatic adjustment algorithm will continue to work properly
as it can cope with increments smaller than 10 [^2].
[^1]: https://www.pivotaltracker.com/story/show/181681739
[^2]: b145a29935/app/dao/provider_details_dao.py (L122)