TLDR: Don't return as many services, and only return their IDs and not
the whole service objects.
Context:
the delete notifications nightly task has been taking longer and longer,
and to delete all three notification types in sequence it now takes up
to 8 hours.
This is because we were retrieving all services, loading them into
memory on the worker, and then trying to delete notifications for each
service in turn.
While it does use a fair chunk of IOPS/CPU on our postgres db, we're not
anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1]
The real issue appears to be that the task is CPU bound on the periodic
worker - we see the worker spike up to 100% CPU regularly across the
whole 3am-11am period.
We also noticed that for each notification type the task first processes
services with custom data retention (not many but some of the biggest
users), then deals with all other services. We can see from looking at
kibana that, for example, the task starts at 3am, and the custom data
retention service email deletions are finished by 3:12am. The rest of
the emails don't get deleted until 5am, so we knew that the problem is
with how it handles the other services.
There are currently 17000 services in the database. On a typical day,
~800 services will have notifications that are over 7 days old and need
to be deleted. By only returning these services, we reduce the amount of
data transfer and serialisation that needs to happen. It takes about two
minutes to retrieve the distinct service ids from the notifications
table for sms notifications, but that is only 5% the size of the full
list so cuts down on a lot of processing
Also, by only returning service_ids rather than the whole `Service`
model we avoid sqlalchemy needing to do lots of data serialisation, when
we were only using the `Service.id` field from that result anyway.
[1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics
[2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now
[3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))
For preview and staging environments, we often send no messages
in a single day. This is currently causing a `DivisionByZero` error
that is rendering the page with no results. This makes it impossible
to look at preview/staging and see if the performance page is
working correctly or not.
(psycopg2.errors.DivisionByZero) division by zero
[SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage
FROM ft_processing_time
WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date]
[parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}]
(Background on this error at: http://sqlalche.me/e/14/9h9h)
I've fixed this by falling back to 100.0% for days we send
no messages. Maybe some argument that it should be N/A rather than
100% but I think it doesn't really matter as this is only
going to affect preview and staging as we will never have a day
sending no messages in production.
There were two problems with the existing message.
1. There was no space between the new status and the time taken
which made reading and searching harder
2. They key bits of information (before and after status) were
separated by the time taken (which will always be unique) meaning
you couldn't do an easy search for a message that is say in delivered
being attempted to be set to temporary-failure.
From experimenting in production we found a "!=" caused the engine
to use a sequential scan, whereas explicitly listing all the types
ensured an index scan was used.
We also found that querying for many (over 100K) items leads to
the task stalling - no logs, but no evidence of it running either -
so we also add a limit to the query.
Since the query now only returns a subset of notifications, we need
to ensure the subsequent "update" query operates on the same batch.
Also, as a temporary measure, we have a loop in the task code to
ensure it operates on the total set of notifications to "time out",
which we assume is less than 500K for the time being.
People with dyslexia and dyscalculia find it difficult to transpose
codes which have consecutive, repeated digits[1].
This commits enhances the algorithm for generating codes to not repeat
the previous digit in a code.
This reduces the key space for our codes from 100,000 possibilities to
65,610 possibilities.
1. https://twitter.com/annaecook/status/1442567679710150662
Regardless of channel.
Do not include:
- broadcasts older than 25.05.2021
- stubbed broadcasts
- broadcasts that were not transmitted. So only broadcasting,
cancelled and completed make the list;
On a regular Notify service anyone with permission can create an API
key. If this service then is given permission to send emergency alerts
it will have an API key which can create emergency alerts. This feels
dangerous.
Secondly, if a service which legitimately has an API key for sending
alerts in training mode is changed to live mode you now have an API key
which people will think isn’t going to create a real alert but actually
will. This feels really dangerous.
Neither of these scenarios are things we should be doing, but having
them possible still makes me feel uncomfortable.
This commit revokes all API keys for a service when its broadcast
settings change, same way we remove all permissions for its users.
We made a change to remove all permissions from users and invited users
when the broadcast service settings form is submitted
(https://github.com/alphagov/notifications-api/pull/3284). However, when the
form is submitted, notifications-admin always adds the `view_activity`
permission even if no permission boxes are ticked, so we don't want to
remove that one permission
(256c840b46/app/main/forms.py (L1042))
The "normal" service permissions and broadcast service permissions are
going to be different with no overlap. This means that if you were
viewing the team members page, there might be permissions in the
database that are not visible on the frontend if a service has changed
type. For example, someone could have the 'manage_api_keys' permission,
which would not show up on the team members page of a broadcast service.
To avoid people having permissions which aren't visible in admin, we now
remove all permissions from users when their service is converted to a
broadcast service.
Permisions for invited users are also removed.
It's not possible to convert a broadcast service to a normal service, so
we don't need to cover for this scenario.
Last year we had an issue with the daily limit cache and the query that was populating it. As a result we have not been checking the daily limit properly. This PR should correct all that.
The daily limit cache is not being incremented in app.notifications.process_notifications.persist_notification, this method is and should always be the only method used to create a notification.
We increment the daily limit cache is redis is enabled (and it is always enabled for production) and the key type for the notification is team or normal.
We check if the daily limit is exceed in many places:
- app.celery.tasks.process_job
- app.v2.notifications.post_notifications.post_notification
- app.v2.notifications.post_notifications.post_precompiled_letter_notification
- app.service.send_notification.send_one_off_notification
- app.service.send_notification.send_pdf_letter_notification
If the daily limits cache is not found, set the cache to 0 with an expiry of 24 hours. The daily limit cache key is service_id-yyy-mm-dd-count, so each day a new cache is created.
The best thing about this PR is that the app.service_dao.fetch_todays_total_message_count query has been removed. This query was not performant and had been wrong for ages.
The query had a group by on notification_type and notification_status, this not only slows the query down but is wrong. The query only looked at the first result, but this query would return as many rows as different notification types and status, meaning the results do not include the correct number.
Are we concerned that all status types are included. For example letters can be cancelled or have validation-failures which shouldn't be included in the daily limit check.
This adds total_letters to the data that is returned by the
`/platform-stats/data-for-billing-report` endpoint so that we can add
total letters as a column in the CSV file that can be downloaded.
The trouble is the aggregate query to return the big blue numbers on the dashboard and /notifications/{notification_type} page is taking too long to return.
I have some ideas on how to improve the query, but should take some time to do some more research and test. In the meantime, let's just ignore "todays" total numbers for the high volume services. There are only two services that this will affect.
It's always going to be in the future anyway.
After some analysis the query does perform better without it.
I'll make a note to update other queries where we get todays
notification data to remove the end date filter in a separate PR.
simplify logic by changing the dao function to require a user id and a
webauthn cred id. Note that this changes the response from a 400 to a
404 if the cred is for a different user than the supplied id.
give a minimum length to the text fields in POSTS to create/update a
credential to avoid surprising unexpected edge cases involving empty
string names etc.
added some simple validation to the delete endpoint for sanity, but
generally my assumption is that more validation will happen on the admin
side.
noteably im not checking whether the credentials are duplicated, nor is
there a uniqueness constraint in the database - I'm not sure if the
credential blob will always reliably be equivalent, and I believe the
browser should hopefully take care of dupes.
We want to replace the value `None` for
service.allowed_broadcast_provider with the value of "all". As a first
step, we need to allow both values. Once notifications-admin has been
changed to pass through "all" and all the data in the database has been
updated, we can update the code to stop supporting both values.
Second attempt [1]. This increases the threshold so:
- It's a more substantial amount of money lost (£16).
- It's 10% of the minimum free allowance for a service.
- It's greater than the threshold we have for TV numbers (500).
Having a higher threshold for this alert will help prevent wasted
effort investigating more negligible failures, and reduces the
ambiguity of whether we should take action: we should.
[1]: https://github.com/alphagov/notifications-api/pull/3221
`service_ids_to_purge` is a list of `row` object rather than a list of `UUID`.
NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.
- sqlalchemy.sql.expression.case must include an else statement.
- clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors.
- remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx.
- update queries now that he user relationship in ServiceUser db model has been removed.
- move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done.
Deleting notifications in the night tasks still needs to be
investigated. The raw sql is causing an error.
`service_ids_to_purge` is a list of `row` object rather than a list of `UUID`.
NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.
- sqlalchemy.sql.expression.case must include an else statement.
- clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors.
- remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx.
- update queries now that he user relationship in ServiceUser db model has been removed.
- move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done.
Deleting notifications in the night tasks still needs to be
investigated. The raw sql is causing an error.
Introduce a contextmanger function to handle exceptions and nested
transactions. Using the nested_transaction will start a
nested transaction with `db.session.begin_nested`, once the nested
transaction is complete the commit will happen.
`@transactional` has been updated to commit unless in a nested
transaction.
db update/insert.
Using a savepoint for the multiple transactions allows us to rollback if
there is an error when executing the second db transaction.
However, this does add a bit of complexity. Developers need to manage
the db session when calling multiple nested tranactions.
Unit tests have been added to test this functionality and some end to
end tests have been done to make sure all transactions are rollback if
there is an exception while executing the transaction.
the default free allowance for the organisation type.
The update/insert for the default free allowance is done in a separate
transaction. Updates to services need to happen in a transaction to
trigger the insert into the ServicesHistory table. For that reason the
call to set_default_free_allowance_for_service is done after the service
is updated.
I've added a try/except around the set_default_free_allowance_for_service call to ensure we still get the update to the service but get an exception log if the update to annual_billing fails. I believe it's important to preserve the update to the service in the unlikely event that the annual_billing upsert fails.
April 1 2021.
In this PR there is a command to set annual_billing for all active
services with the the new defaults.
The new method `set_default_free_allowance_for_service` will also be
called in a PR to follow that will set a services free allowance to the
default if the organisation for the service is changed.
for the service.
The letters rates for cronw and non crown are the same. It would be nice
to remove the need for crown but for now this is a quick fix.
Previously we did some unnecessary work:
- Collate task. This had one S3 request to get a summary of the object,
which was then used in another request to get the full object. We only
need the size of the object, which is included in the summary [1].
- Archive task. This had one S3 request to get a summary of the object,
which was then used to make another request to delete it. We still need
both requests, but we can remove the S3.Object in the middle.
[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary
Previously we made a call to S3 to list objects for a letter, even
though we already had the precise key of the single object to hand.
This removes the one usage of "get_s3_bucket_objects" and uses the
filename directly in the call to remove the object.
Previously, the function would just return a presumed filename. Now that
it actually checks s3, if the file doesn't exist it'll raise an
exception. By default that's a StopIteration at the end of the bucket
iterator, which isn't ideal as this will get supressed if the function
is called within a generator loop further up or anything.
There are a couple of places where we expect the file may not exist, so
we define a custom exception to rescue specifically here. I did consider
subclassing boto's ClientError, but this wasn't straightforward as the
constructor expects to know the operation that failed, which for me is a
signal that it's not an appropriate (re-)use of the class.
Previously we generated the filename we expected a letter PDF to be
stored at in S3, and used that to retrieve it. However, the generated
filename can change over the course of a notification's lifetime e.g.
if the service changes from crown ('.C.') to non-crown ('.N.').
The prefix of the filename is stable: it's based on properties of the
notification - reference and creation - that don't change. This commit
changes the way we interact with letter PDFs in S3:
- Uploading uses the original method to generate the full file name.
The method is renamed to 'generate_' to distinguish it from the new one.
- Downloading uses a new 'find_' method to get the filename using just
its prefix, which makes it agnostic to changes in the filename suffix.
Making this change helps to decouple our code from the requirements DVLA
have on the filenames. While it means more traffic to S3, we rely on S3
in any case to download the files. From experience, we know S3 is highly
reliable and performant, so don't anticipate any issues.
In the tests we favour using moto to mock S3, so that the behaviour is
realistic. There are a couple of places where we just mock the method,
since what it returns isn't important for the test.
Note that, since the new method requires a notification object, we need
to change a query in one place, the columns of which were only selected
to appease the original method to generate a filename.
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats
We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.