We're going to iterate how we use the function with a limit, so we
shouldn't say it's "temporary" anymore. We don't need to change the
default, but having it in the function parameters makes it easier to
see the funtion doesn't time out all notifications, just some.
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
We have been running in to the problem in
pallets/flask-sqlalchemy#518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.
As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
This commit is analagous to
c68d1a2f23
The only difference is that in that case, the pagination links are
used to show prev and/or next links in the admin app. In this case,
the pagination links are only used to see if there is a page 2, and
if there is, say that we are only showing the first 50 results.
The way it was done before, the remainder was incorrect in the
billing report and in the org usage query - it was the sms remainder
left at the start of the report period, not at the end of that period.
This became apparent when we tried to show sms_remainder on the org
usage report, where start date is always the start of the financial year.
We saw that sms sent by services did not reduce their free allowance
remainder according to the report. As a result of this, we had to
temporarily remove of sms_remainder column from the report, until
we fix the bug - it has been fixed now, yay!
I think the bug has snuck in partially because our fixtures for testing
this part of the code are quite complex, so it was
harder to see that numbers don't add up. I have added comments
to the tests to try and make it a bit clearer why the results are
as they are.
I also added comments to the code, and renamed some variables,
to make it easier to understand, as there are quite a few
moving parts in it - subqueries and the like.
I also renamed the fetch_sms_free_allowance_remainder method to
fetch_sms_free_allowance_remainder_until_date so it is clearer
what it does.
In response to [1].
[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988
It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.
It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
It's no longer necessary to have a separate function that's now
only called once. While sometimes the separation can bring clarity,
here I think it's clearer to have all the code in one place, and
avoid the functools complexity we had before.
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.
In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
We have been running in to the problem in
https://github.com/pallets/flask-sqlalchemy/issues/518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.
As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
we used to do this until apr 2020. Let's try doing it again.
Back then, we had problems with timing. We did two things in spring
2020:
We moved to using an intermediary temp table [1]
We stopped the tasks being parallelised [2]
However, it turned out the real time saving was from changing what
services we delete for [3]. The task was actually CPU-bound rather than
DB-bound, so that's probably why having the tasks in parallel wasn't
helping, since they were all competing for the same CPU. It's worth
trying the parallel steps again now that we're no longer CPU bound.
Note: Temporary tables are in their own postgres schema, and are only
viewable by the current session (session == connection. Each celery
worker process has its own db connection). We don't need to worry about
separate workers both trying to use the same table at once.
I've also added a "DROP ON COMMIT" directive to the table definition
just to ensure it doesn't persist past the task even if there's an
exception. (This also drops on rollback).
Cronitor looks at the three functions separately so we don't need to worry
about the main task taking milliseconds where it used to take hours as
it isn't monitored itself.
I've also removed some unnecessary redundant exception logs.
[1] https://github.com/alphagov/notifications-api/pull/2767
[2] https://github.com/alphagov/notifications-api/pull/2798
[3] https://github.com/alphagov/notifications-api/pull/3381
Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.
TLDR: Don't return as many services, and only return their IDs and not
the whole service objects.
Context:
the delete notifications nightly task has been taking longer and longer,
and to delete all three notification types in sequence it now takes up
to 8 hours.
This is because we were retrieving all services, loading them into
memory on the worker, and then trying to delete notifications for each
service in turn.
While it does use a fair chunk of IOPS/CPU on our postgres db, we're not
anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1]
The real issue appears to be that the task is CPU bound on the periodic
worker - we see the worker spike up to 100% CPU regularly across the
whole 3am-11am period.
We also noticed that for each notification type the task first processes
services with custom data retention (not many but some of the biggest
users), then deals with all other services. We can see from looking at
kibana that, for example, the task starts at 3am, and the custom data
retention service email deletions are finished by 3:12am. The rest of
the emails don't get deleted until 5am, so we knew that the problem is
with how it handles the other services.
There are currently 17000 services in the database. On a typical day,
~800 services will have notifications that are over 7 days old and need
to be deleted. By only returning these services, we reduce the amount of
data transfer and serialisation that needs to happen. It takes about two
minutes to retrieve the distinct service ids from the notifications
table for sms notifications, but that is only 5% the size of the full
list so cuts down on a lot of processing
Also, by only returning service_ids rather than the whole `Service`
model we avoid sqlalchemy needing to do lots of data serialisation, when
we were only using the `Service.id` field from that result anyway.
[1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics
[2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now
[3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))
For preview and staging environments, we often send no messages
in a single day. This is currently causing a `DivisionByZero` error
that is rendering the page with no results. This makes it impossible
to look at preview/staging and see if the performance page is
working correctly or not.
(psycopg2.errors.DivisionByZero) division by zero
[SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage
FROM ft_processing_time
WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date]
[parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}]
(Background on this error at: http://sqlalche.me/e/14/9h9h)
I've fixed this by falling back to 100.0% for days we send
no messages. Maybe some argument that it should be N/A rather than
100% but I think it doesn't really matter as this is only
going to affect preview and staging as we will never have a day
sending no messages in production.
There were two problems with the existing message.
1. There was no space between the new status and the time taken
which made reading and searching harder
2. They key bits of information (before and after status) were
separated by the time taken (which will always be unique) meaning
you couldn't do an easy search for a message that is say in delivered
being attempted to be set to temporary-failure.
From experimenting in production we found a "!=" caused the engine
to use a sequential scan, whereas explicitly listing all the types
ensured an index scan was used.
We also found that querying for many (over 100K) items leads to
the task stalling - no logs, but no evidence of it running either -
so we also add a limit to the query.
Since the query now only returns a subset of notifications, we need
to ensure the subsequent "update" query operates on the same batch.
Also, as a temporary measure, we have a loop in the task code to
ensure it operates on the total set of notifications to "time out",
which we assume is less than 500K for the time being.
People with dyslexia and dyscalculia find it difficult to transpose
codes which have consecutive, repeated digits[1].
This commits enhances the algorithm for generating codes to not repeat
the previous digit in a code.
This reduces the key space for our codes from 100,000 possibilities to
65,610 possibilities.
1. https://twitter.com/annaecook/status/1442567679710150662
Regardless of channel.
Do not include:
- broadcasts older than 25.05.2021
- stubbed broadcasts
- broadcasts that were not transmitted. So only broadcasting,
cancelled and completed make the list;
On a regular Notify service anyone with permission can create an API
key. If this service then is given permission to send emergency alerts
it will have an API key which can create emergency alerts. This feels
dangerous.
Secondly, if a service which legitimately has an API key for sending
alerts in training mode is changed to live mode you now have an API key
which people will think isn’t going to create a real alert but actually
will. This feels really dangerous.
Neither of these scenarios are things we should be doing, but having
them possible still makes me feel uncomfortable.
This commit revokes all API keys for a service when its broadcast
settings change, same way we remove all permissions for its users.
We made a change to remove all permissions from users and invited users
when the broadcast service settings form is submitted
(https://github.com/alphagov/notifications-api/pull/3284). However, when the
form is submitted, notifications-admin always adds the `view_activity`
permission even if no permission boxes are ticked, so we don't want to
remove that one permission
(256c840b46/app/main/forms.py (L1042))
The "normal" service permissions and broadcast service permissions are
going to be different with no overlap. This means that if you were
viewing the team members page, there might be permissions in the
database that are not visible on the frontend if a service has changed
type. For example, someone could have the 'manage_api_keys' permission,
which would not show up on the team members page of a broadcast service.
To avoid people having permissions which aren't visible in admin, we now
remove all permissions from users when their service is converted to a
broadcast service.
Permisions for invited users are also removed.
It's not possible to convert a broadcast service to a normal service, so
we don't need to cover for this scenario.
Last year we had an issue with the daily limit cache and the query that was populating it. As a result we have not been checking the daily limit properly. This PR should correct all that.
The daily limit cache is not being incremented in app.notifications.process_notifications.persist_notification, this method is and should always be the only method used to create a notification.
We increment the daily limit cache is redis is enabled (and it is always enabled for production) and the key type for the notification is team or normal.
We check if the daily limit is exceed in many places:
- app.celery.tasks.process_job
- app.v2.notifications.post_notifications.post_notification
- app.v2.notifications.post_notifications.post_precompiled_letter_notification
- app.service.send_notification.send_one_off_notification
- app.service.send_notification.send_pdf_letter_notification
If the daily limits cache is not found, set the cache to 0 with an expiry of 24 hours. The daily limit cache key is service_id-yyy-mm-dd-count, so each day a new cache is created.
The best thing about this PR is that the app.service_dao.fetch_todays_total_message_count query has been removed. This query was not performant and had been wrong for ages.
The query had a group by on notification_type and notification_status, this not only slows the query down but is wrong. The query only looked at the first result, but this query would return as many rows as different notification types and status, meaning the results do not include the correct number.
Are we concerned that all status types are included. For example letters can be cancelled or have validation-failures which shouldn't be included in the daily limit check.
This adds total_letters to the data that is returned by the
`/platform-stats/data-for-billing-report` endpoint so that we can add
total letters as a column in the CSV file that can be downloaded.
The trouble is the aggregate query to return the big blue numbers on the dashboard and /notifications/{notification_type} page is taking too long to return.
I have some ideas on how to improve the query, but should take some time to do some more research and test. In the meantime, let's just ignore "todays" total numbers for the high volume services. There are only two services that this will affect.
It's always going to be in the future anyway.
After some analysis the query does perform better without it.
I'll make a note to update other queries where we get todays
notification data to remove the end date filter in a separate PR.
simplify logic by changing the dao function to require a user id and a
webauthn cred id. Note that this changes the response from a 400 to a
404 if the cred is for a different user than the supplied id.
give a minimum length to the text fields in POSTS to create/update a
credential to avoid surprising unexpected edge cases involving empty
string names etc.
added some simple validation to the delete endpoint for sanity, but
generally my assumption is that more validation will happen on the admin
side.
noteably im not checking whether the credentials are duplicated, nor is
there a uniqueness constraint in the database - I'm not sure if the
credential blob will always reliably be equivalent, and I believe the
browser should hopefully take care of dupes.
We want to replace the value `None` for
service.allowed_broadcast_provider with the value of "all". As a first
step, we need to allow both values. Once notifications-admin has been
changed to pass through "all" and all the data in the database has been
updated, we can update the code to stop supporting both values.
Second attempt [1]. This increases the threshold so:
- It's a more substantial amount of money lost (£16).
- It's 10% of the minimum free allowance for a service.
- It's greater than the threshold we have for TV numbers (500).
Having a higher threshold for this alert will help prevent wasted
effort investigating more negligible failures, and reduces the
ambiguity of whether we should take action: we should.
[1]: https://github.com/alphagov/notifications-api/pull/3221
`service_ids_to_purge` is a list of `row` object rather than a list of `UUID`.
NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.
- sqlalchemy.sql.expression.case must include an else statement.
- clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors.
- remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx.
- update queries now that he user relationship in ServiceUser db model has been removed.
- move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done.
Deleting notifications in the night tasks still needs to be
investigated. The raw sql is causing an error.
`service_ids_to_purge` is a list of `row` object rather than a list of `UUID`.
NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.
- sqlalchemy.sql.expression.case must include an else statement.
- clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors.
- remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx.
- update queries now that he user relationship in ServiceUser db model has been removed.
- move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done.
Deleting notifications in the night tasks still needs to be
investigated. The raw sql is causing an error.