we really don't gain anything by running each service delete in sequence
- we get the services, and then just loop through them deleting per
service. By deleting per service in separate tasks, we can take
advantage of parallelism. the only thing we lose is some log lines but I
don't think we're that interested in them.
only set query limit at the move_notifications dao function - the task
doesn't really care about the technical implementation of how it deletes
the notifications
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.
This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.
[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
In response to [1].
[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988
It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.
It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
This removes 3 duplicate instances of the same code, which is still
tested implicitly via test_process_ses_receipt_tasks [1]. In the
next commit we'll make this test more explicit, to reflect that it's
now being reused elsewhere and shouldn't change arbitrarily.
We do lose the "print" statement from the command instance of the
code, but I think that's a very tolerable loss.
[1]: 16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)
This now matches the behaviour of the test above it: mocking out
the DAO function in order to focus on the specific behaviour of
the function under test.
These scenarios are already covered by the DAO tests. It's enough
to just check the DAO function is called as expected.
While sometimes it can be better to have more end-to-end tests, the
convention across much of this app is to do unit tests.
I find it really difficult to visually parse test files unless we
have a consistent convention for how we name our test functions.
In most of our tests the name of the test function starts with the
name of the function under test.
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.
In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.
We need to remove the banner when the day is over.
The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.
1. 5a274af6d0/app/models/alerts.py (L38-L44)
Previously we passed along this piece of state via the kwargs for
a task, but this runs the risk of the task accidentally receiving
the extra kwarg unless we've covered all the code paths that could
invoke it directly e.g. retries don't invoke __call__.
This switches to using Celery "headers" to pass the extra state. It
turns out that a Celery has two "header" concepts, which leads to
some confusion and even a bug with the framework [1]:
- In older (pre v4.4) versions of Celery, the "headers" specified
by apply_async() would become _the_ headers in the message that
gets passed around workers, etc. These would be available later on
via "self.request.headers".
- Since Celery protocol v2, the meaning of "headers" in the message
changed to become (basically) _all_ metadata about the task [2],
with the "headers" option in apply_async() being merged [3] into
the big dict of metadata.
This makes using headers a bit confusing unfortunately, since the
data structure we put in is subtly different to what comes out in
the request context. Nonetheless, it still works. I've added some
comments to try and clarify it.
Note that one of the original tests is no longer necessary, since we
don't need to worry about argument passing styles with headers.
[1]: https://github.com/celery/celery/issues/4875
[2]: 663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)
[3]: 681a922220/celery/app/amqp.py (L495)
This adds a task which is designed to be used if we want to recreate the
PDF for a precompiled letter (either one that has been created using the
API or one that has been uploaded through the website).
The task takes the `notification_id` of the letter and passes template
preview the details it needs in order to sanitise the original file and
then replace the version in the letters-pdf bucket with the freshly
sanitised version.
Previously we sent them emails about this manually. We also tried
a Zendesk macro/trigger approach, but using a CC means:
- We can control the behaviour ourselves (Zendesk triggers can only
be edited by admins outside our team).
- We keep the DVLA notification approach consistent and in one place,
so notifications always go to the same people.
- Any further (public) updates to the ticket will also trigger a
notification to DVLA (previous trigger only notified on creation).
When a precompiled letter is sent to us, we set the `to` field as
'Provided as PDF' in
1c1023a877/app/v2/notifications/post_notifications.py (L100-L104)
This then also sets `normalised_to` as `providedaspdf`.
However, when template preview sanitises the letter, pulls out the
address and gives it to the API, we were only setting `to` to be
the new address and had forgotten to also amend `normalised_to` to
be the normalised version. This meant that for all these letters
we accidentally left `normalised_to` as `providedaspdf`. The impact
of this was that we can not then search for these letters in the
admin user interface as they rely on the `normalised_to` field
containing the recipient address.
This commit fixes that bug by also setting the `normalised_to`
field
Previously these metrics weren't very useful because they could be
skewed by long timings for failed notifications, which can take up
to 72 hours to deliver. I'm intentionally not trying to have a dual
running period (with the old and new names) because:
- We don't use the current stats for anything (checking Grafana).
- The current stats get turned into a "bucket" metric in Prometheus
[1][2], which isn't very useful because it can only tell us the mean
time to deliver, but we're actually interested in percentiles.
Switching to a new naming is an opportunity to fix the raw data and
the way it's aggregated, using the same kind of "summary" metric that
we now use for stats about our Celery tasks [3].
[1]: c330a8ac8a/paas/statsd/statsd-mapping.yml (L82)
[2]: https://prometheus.io/docs/practices/histograms/#quantiles
[3]: https://github.com/alphagov/notifications-aws/pull/890
The `auto-expire-broadcast-messages` task checks for expired broadcasts
at five minute intervals. This change now calls the
`publish-govuk-alerts` task in govuk-alerts if there are expired
broadcasts so that the site is updated.
Co-authored-by: Katie Smith <katie.smith@digital.cabinet-office.gov.uk>
When we send or cancel a broadcast message, we now trigger a task
in govuk-alerts repo that polls our API for alerts and
publishes a fresh list of alerts.
Co-authored-by: Pea Tyczynska <pea.tyczynska@digital.cabinet-office.gov.uk>
include a link to a runbook entry.
also the list of acknowledgement files can be very long, so make that
the last thing, and use new lines to space out the message.
This updates the tickets that are created when the
`check_if_letters_still_pending_virus_check` scheduled task detects
letters in the `pending-virus-check` state.
Specifically, no longer test for a p1 zendesk when sending an alert
and drop misleading "p1" from test name when cancelling an alert.
We're no longer creating a P1 from the code, but we _do_ create a
zendesk ticket when sending out an alert.
When cancelling, what we want to test is that we don't create a second
ticket when the alert is cancelled.
Broadcasts created via the API [1] and the Admin app [2] should
both now have this field set. It's also more informative to show
this, and broadcasts created via the API don't have IDs anyway.
There's a small risk that an old broadcast that gets approved won't
have this data, but it's for information only and we intend to
backfill all old broadcasts in the near future.
[1]: 023a06d5fb
[2]: 7dbe3afa19
In one case ("areas=['manchester']") the format was even invalid,
but in general the original value of the column is pretty much
irrelevant for tests that involve updating it (it's highly unlikely
the column would default to the same value as the test data).
Since the expiry is sent as part of the message payload, we don't
need to invoke the CBC proxies (and indeed there's no way to do so
for an expired alert). In future we plan to extend this task so it
triggers the regeneration of content on gov.uk/alerts.
It's worth noting that 'finishes_at' can theoretically be None, in
which case it's unclear when the alert should expire. While alerts
from the Admin app should always have an expiry [1], we have many
in the DB that don't, so it's worth checking for this scenario.
[1]: 078ac10c8d/app/models/broadcast_message.py (L255)
Unlike the other IDs which are stored in the DB, this isn't relevant
for the Celery task as it invokes a link test. Moving it into the
proxy client will also enable us to generate a second ID in the next
commits, where we start doing a link test for the failover lambda.
Previously the Celery task to trigger a link test had to know about
the special case of a sequence number for Vodafone. Since we're about
to change the client to perform multiple tests it makes sense to give
it the knowledge of how to generate number itself.
Note that we have to import the db inline to avoid a circular import,
since this module is itself imported by app/__init__.py.
Other invocations of the Vodafone client use stored sequence numbers
from the DB, which are called "message numbers" in that context. Since
the two use cases are very different (even the names are different!),
having them in two places shouldn't cause any confusion.
if we're served a 429, put the item on the retry queue and retry the
same as if the service returned a 5xx. 429 is commonly returned for rate
limit exceeding, and retrying on a delay is a typical response to that.
Last year we had an issue with the daily limit cache and the query that was populating it. As a result we have not been checking the daily limit properly. This PR should correct all that.
The daily limit cache is not being incremented in app.notifications.process_notifications.persist_notification, this method is and should always be the only method used to create a notification.
We increment the daily limit cache is redis is enabled (and it is always enabled for production) and the key type for the notification is team or normal.
We check if the daily limit is exceed in many places:
- app.celery.tasks.process_job
- app.v2.notifications.post_notifications.post_notification
- app.v2.notifications.post_notifications.post_precompiled_letter_notification
- app.service.send_notification.send_one_off_notification
- app.service.send_notification.send_pdf_letter_notification
If the daily limits cache is not found, set the cache to 0 with an expiry of 24 hours. The daily limit cache key is service_id-yyy-mm-dd-count, so each day a new cache is created.
The best thing about this PR is that the app.service_dao.fetch_todays_total_message_count query has been removed. This query was not performant and had been wrong for ages.
Many of the team members do not look at emails from zendesk, adding a current_app.logger.error message for things we care about to give developers a better chance of seeing them.
I have purposely not added an erro log for `check_for_services_with_high_failure_rates_or_sending_to_tv_numbers` because it's not something we need to look at immediately.
This will allow admin to pass through a value of "government" for the
broadcast_channel. We don't have any logic around the value of service.broadcast_channel,
so no updates are needed to the tasks etc.
We want to replace the value `None` for
service.allowed_broadcast_provider with the value of "all". As a first
step, we need to allow both values. Once notifications-admin has been
changed to pass through "all" and all the data in the database has been
updated, we can update the code to stop supporting both values.
DVLA would like to be able to identify letters sent by the Insolvency
Service, so we are changing the zipfile name. They need all zipfile
names to have the same structure, so we can't just add a marker to files
sent by that service - we have to change all filenames.
The new format is like this:
`{NOTIFY}.{DATE}.{SEQUENCE_ID}.{UNIQUE_ID}.{SERVICE_ID}.{ORG_NAME}.{EXTENSION}`