When we cloned the repository and started making modifications, we
didn't initially keep tests in step. This commit tries to get us to a
clean test run by skipping tests that are failing and removing some
that we no longer expect to use (MMG, Firetext), with the intention that
we will come back in future and update or remove them as appropriate.
To find all tests skipped, search for `@pytest.mark.skip(reason="Needs
updating for TTS:`. There will be a brief description of the work that
needs to be done to get them passing, if known. Delete that line to make
them run in a standard test run (`make test`).
This effectively reverts [^1], which was only a temporary change.
I suspect the performance problem will go away with [^2].
While we've clearly been managing without this change, it resulted
in several rows being left as incorrect when letter receipts were
delayed. It makes sense for us to run this task for the same period
as we do to aggregate statuses, as status affects billing.
[^1]: e5c76ffda7
[^2]: https://github.com/alphagov/notifications-api/pull/3542
The new alert happens earlier but is otherwise the same:
- We only create a ticket in Production.
- We only create a ticket on approval.
I took this opportunity to refactor the alert as a private function
and test this specifically in detail to avoid lots of repetitive
mocks, which are required when calling the main "update" function.
One test I haven't preserved was for when the "names" array is empty,
as this was added for a legacy data integrity scenario [^1].
[^1]: bf0bf4e31c
This is enough to update a notification in DB:
1. First create a notification in the UI and sent it.
2. Then reset its attributes to pretend it's for Reach.
update notifications set
sent_at = null,
sent_by = null,
notification_status='sending'
where id='some-uuid';
3. Change "notification_id" to "<some-uuid>" in the code.
4. Call the boilerplate endpoint for Reach callbacks.
curl -X POST localhost:6011/notifications/sms/reach
Interestingly there's no foreign key constraint on "sent_by" in the
DB, so this just works: the notification is updated.
there's not anything we know we need to do now that we resolve stuck
letters automatically. Letters couuld still get into this state, so it's
worth alerting us. However, we don't have anything concrete that we know
how to fix these letters, so we should just remove the runbook entirely.
It is currently 60 seconds but we have had two incidents in the
past week where there is a connection error talking to a service
and the request takes up to 60 seconds before failing. When this
happens, if there are a few of these callbacks then all of them
will completely hog the service callback worker and build up a big
queue of all the other service callbacks.
5 seconds has been chosen as that is still a pretty decent length
time for a simple web request that should just be giving them a
little bit of information for them to store. 5 seconds should be a
sufficient enough reduction that we dramatically reduce this problem
for the moment.
Open to this number
being changed in the future based on how we see it perform.
Since sept 2019 we've had to log on to production around once every
twenty days to restart the virus scan task for a letter. Most of the
time this is just a case of making sure the file is in the scan bucket,
and then triggering the task. If the file isn't in the scan bucket we'd
need to do some more manual investigation to find out exactly where the
file got stuck, but I can only remember times when it's been in the scan
bucket.
So if the file is in the scan bucket, we can just check that with code
and kick the task off automatically.
Previously we were looping over data from the Notifications/History
table and then shovelling it into the status table, one row at a time
- plus an extra delete to clean up any existing data.
This replaces that with a batch insertion, similar to how we archive
notifications [1], but using a simple subquery (via "from_select" [2])
instead of a temporary table.
To make the select compatible with the insert, I've used "literal"
to inject the constant pieces of data, so each row has everything it
needs to go into the status table.
[1]: 9ce6d2fe92/app/dao/notifications_dao.py (L295)
[2]: https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select
This takes a similar approach to the nightly deletion task so that
we only create sub-tasks when there are actually notifications to
aggregate for a given type and day [1].
We're making this change to stop the duplication errors we're getting
at the moment and ensure the task can scale to more messages and more
services. There are two parts to this:
- Each subtask should now run within the 5 minute visibility timeout.
However, they may still be duplicated if the parent task overruns [2].
- The parent task creates a mininal number of subtasks, and the query
to determine this is very fast for a normal process day (milliseconds).
Since all tasks will run quickly, there should be no more duplication.
In order to test this more nuanced task, I rewrote the tests:
- One test checks the subtask is called correctly.
- One test checks we create all the right subtasks.
[1]: https://github.com/alphagov/notifications-api/pull/3381
[2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl
These can be inferred elsewhere:
- Task creation is obvious from task execution. If we're concerned
about a specific service, we can check the updated times on the DB
records, since all records are recreated each time this runs.
- Task starting is already logged.
- Task completion is already logged. The number of rows updated can
also be inferred from the DB.
The log I've found useful is the one about fetching the data, and
I've also added another to time how long it takes to insert the data,
as both could be sources of poor performance.
Arguably we should use metrics for this sort of thing, but logs are
easier in practice for the metric systems we have.
Changes:
53.0.0
---
* `notifications_utils.columns.Columns` has moved to
`notifications_utils.insensitive_dict.InsensitiveDict`
* `notifications_utils.columns.Rows` has moved to
`notifications_utils.recipients.Rows`
* `notifications_utils.columns.Cell` has moved to
`notifications_utils.recipients.Cell`
52.0.0
---
* Deprecate the following unused `redis_client` functions:
- `redis_client.increment_hash_value`
- `redis_client.decrement_hash_value`
- `redis_client.get_all_from_hash`
- `redis_client.set_hash_and_expire`
- `redis_client.expire`
51.3.1
---
* Bump govuk-bank-holidays to cache holidays for next year.
we try and delete for lots of services. this includes services that
don't actually have anything to delete that day. that might be because
they had a custom data retention so we always go to check them, or
because they only sent test notifications (which we'll delete but not
include in the count in the log line). we don't really need to see log
lines saying that we didn't delete anything for that service - that's
just a long list of boring log messages that will hide the actual
interesting stuff - which services we did delete content for.
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.