Commit Graph

885 Commits

Author SHA1 Message Date
Ben Thorner
779b8e941f Rewrite broadcast Zendesk alert at approval time
The new alert happens earlier but is otherwise the same:

- We only create a ticket in Production.
- We only create a ticket on approval.

I took this opportunity to refactor the alert as a private function
and test this specifically in detail to avoid lots of repetitive
mocks, which are required when calling the main "update" function.

One test I haven't preserved was for when the "names" array is empty,
as this was added for a legacy data integrity scenario [^1].

[^1]: bf0bf4e31c
2022-04-05 12:57:08 +01:00
Ben Thorner
b439fd0718 Add boilerplate for Reach SMS callbacks
This is enough to update a notification in DB:

1. First create a notification in the UI and sent it.

2. Then reset its attributes to pretend it's for Reach.

    update notifications set
      sent_at = null,
      sent_by = null,
      notification_status='sending'
    where id='some-uuid';

3. Change "notification_id" to "<some-uuid>" in the code.

4. Call the boilerplate endpoint for Reach callbacks.

    curl -X POST localhost:6011/notifications/sms/reach

Interestingly there's no foreign key constraint on "sent_by" in the
DB, so this just works: the notification is updated.
2022-03-24 16:56:33 +00:00
Ben Thorner
a962721915 Move SMS client response tests to match their app/
This makes them easier to find and removes ambiguity with other
files related to "processing" in the app.
2022-03-24 16:47:30 +00:00
Leo Hemsted
00259893f1 automatically retry letters stuck in pending-virus-scan
Since sept 2019 we've had to log on to production around once every
twenty days to restart the virus scan task for a letter. Most of the
time this is just a case of making sure the file is in the scan bucket,
and then triggering the task. If the file isn't in the scan bucket we'd
need to do some more manual investigation to find out exactly where the
file got stuck, but I can only remember times when it's been in the scan
bucket.

So if the file is in the scan bucket, we can just check that with code
and kick the task off automatically.
2022-03-07 18:31:46 +00:00
Katie Smith
bff4e3a709 Fix test which was flakey due to order of items returned from db
The test was querying `FactNotificationStatus` and ordering the results
by bst_date and notification_type then checking the rows. However, the
bst_date and notification_type for each row is the same, so this test
could fail based on the order that the results came back in. By ordering
on the notification_status instead, we can be sure of the order of the
results.
2022-02-28 13:03:55 +00:00
Ben Thorner
574b1d3c63 Fix not deleting dead rows in status table
To address: https://github.com/alphagov/notifications-api/pull/3454#pullrequestreview-880302729

Since we now delete all rows before inserting fresh ones, we no
longer need to worry about conflicts. I've also extended the old
test to check all three kinds of overwrite: new, changed, gone.
2022-02-16 13:40:08 +00:00
Ben Thorner
1213463b8e Only aggregate status when necessary for a service
This takes a similar approach to the nightly deletion task so that
we only create sub-tasks when there are actually notifications to
aggregate for a given type and day [1].

We're making this change to stop the duplication errors we're getting
at the moment and ensure the task can scale to more messages and more
services. There are two parts to this:

- Each subtask should now run within the 5 minute visibility timeout.
However, they may still be duplicated if the parent task overruns [2].

- The parent task creates a mininal number of subtasks, and the query
to determine this is very fast for a normal process day (milliseconds).

Since all tasks will run quickly, there should be no more duplication.

In order to test this more nuanced task, I rewrote the tests:

- One test checks the subtask is called correctly.
- One test checks we create all the right subtasks.

[1]: https://github.com/alphagov/notifications-api/pull/3381
[2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl
2022-02-09 17:39:07 +00:00
Ben Thorner
018a253b6f Revert "Revert running status aggregation in parallel"
This reverts commit 0f6dea0deb.
2022-02-09 17:39:00 +00:00
Chris Hill-Scott
0614a70764 Merge pull request #3445 from alphagov/bump-utils-53.0.0
Bump utils to 53.0.0
2022-02-08 09:56:18 +00:00
Chris Hill-Scott
7f72d3a60f Bump utils to 53.0.0
Changes:

53.0.0
---

* `notifications_utils.columns.Columns` has moved to
  `notifications_utils.insensitive_dict.InsensitiveDict`
* `notifications_utils.columns.Rows` has moved to
  `notifications_utils.recipients.Rows`
* `notifications_utils.columns.Cell` has moved to
  `notifications_utils.recipients.Cell`

52.0.0
---

* Deprecate the following unused `redis_client` functions:
  - `redis_client.increment_hash_value`
  - `redis_client.decrement_hash_value`
  - `redis_client.get_all_from_hash`
  - `redis_client.set_hash_and_expire`
  - `redis_client.expire`

51.3.1
---

* Bump govuk-bank-holidays to cache holidays for next year.
2022-02-08 09:45:10 +00:00
David McDonald
1d8fafcdf4 Remove unused functions
Can't see these being used anywhere so lets get rid of them
2022-02-07 15:58:04 +00:00
Rebecca Law
09c8fbe982 Merge pull request #3418 from alphagov/letters-too-long
Mark letters as validation-failed if the templated letter is too long.
2022-02-02 08:30:50 +00:00
Rebecca Law
c01c81326c Update log message to something a little easier to read and query for. 2022-01-24 12:25:53 +00:00
Ben Thorner
0f6dea0deb Revert running status aggregation in parallel
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
2022-01-20 12:22:33 +00:00
Rebecca Law
6cd7a23d3c If there is an invalid letter that has not been updated to validation-failed because the update-validation-failed-for-templated-letter has not been picked up off the letter-tasks queue and the collate-letter-pdfs-to-be-sent has started.
1. The number of letters that we send to DVLA will be not be correct (see 20ead82463/app/celery/letters_pdf_tasks.py (L136))
This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported.
2. When we get the PDF from S3 we will get a file not found 20ead82463/app/celery/letters_pdf_tasks.py (L244)
The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions.

Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.
2022-01-19 08:31:19 +00:00
Rebecca Law
841a4fc22f Mark letters as validation-failed if the templated letter is too long.
It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications.
NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload.

A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task.

New work flow for a letter in brief:
API - receives POST /v2/notifications
:: save to db
:: put CREATE_LETTERS_PDF task on queue for template preview to consume
TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF
:: create PDF
:: count pages of PDF
:: IF page count exceeds 10 pages
	 put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters)
	 put `update-validation-failed-for-templated-letter` task on the queue for the API to consume
   ELSE
     put PDF in the `letters-pdf` bucket
     put `update-billable-units-for-letter` task on the queue
API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task
:: IF `update-billable-units-for-letter` task:
   	update billable units for notification as usual
:: ELSE `update-validation-failed-for-templated-letter`:
   	update notification_status = `validation-failed`
ADMIN - view notification page for letter
:: show validation letter for templated letter

There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app.

Deployment plan

Deploy Admin first
Deploy API
Deploy template-preview

Related PRs:
alphagov/notifications-template-preview#619
alphagov/notifications-admin#4107

https://www.pivotaltracker.com/story/show/169209742
2022-01-19 08:29:48 +00:00
Ben Thorner
9182ebf4e5 Parallelise status aggregation by service and day
This follows a similar approach as [1]. Recently we've seen lots
of errors from this task, which we think are a consequence of it
doing too much work and tripping Celery's visibility timeout.

While we can optimise the query [2], it's likely the errors will
return as the number of live services grows. Parallelising the
aggregation now will make it more futureproof.

[1]: https://github.com/alphagov/notifications-api/pull/3397
[2]: https://github.com/alphagov/notifications-api/pull/3417
2022-01-12 15:47:59 +00:00
Ben Thorner
9fc8b904c6 DRY up status aggregation tests (move DAO tests up)
The previous DAO tests were also confusing because they were testing
two functions at the same time, so moving the tests up to the task
level seems very reasonable, and will make it easier to change how
this code works in the next commits.
2022-01-11 16:11:36 +00:00
Leo Hemsted
49cc1b643f split delete task up into per service
we really don't gain anything by running each service delete in sequence
- we get the services, and then just loop through them deleting per
service. By deleting per service in separate tasks, we can take
advantage of parallelism. the only thing we lose is some log lines but I
don't think we're that interested in them.

only set query limit at the move_notifications dao function - the task
doesn't really care about the technical implementation of how it deletes
the notifications
2021-12-14 15:24:34 +00:00
Ben Thorner
87cd40d00a Scale timeout task to work on arbitrary volumes
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.

This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.

[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
2021-12-13 17:14:28 +00:00
Ben Thorner
76aeab24ce Rewrite DAO timeout method to take cutoff_time
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
2021-12-13 16:56:21 +00:00
Ben Thorner
ab4cb029df Remove alert for email / sms in created
In response to [1].

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988

It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.

It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
2021-12-06 14:11:42 +00:00
Ben Thorner
9bd2a9b427 Extract tests for conditionally creating callback
This will help ensure the function doesn't change arbitrarily, now
that it's used in multiple other places.
2021-12-06 14:11:41 +00:00
Ben Thorner
04da017558 DRY-up conditionally creating callback tasks
This removes 3 duplicate instances of the same code, which is still
tested implicitly via test_process_ses_receipt_tasks [1]. In the
next commit we'll make this test more explicit, to reflect that it's
now being reused elsewhere and shouldn't change arbitrarily.

We do lose the "print" statement from the command instance of the
code, but I think that's a very tolerable loss.

[1]: 16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)
2021-12-06 14:11:34 +00:00
Ben Thorner
aea555fce2 Make test for timeout with callbacks consistent
This now matches the behaviour of the test above it: mocking out
the DAO function in order to focus on the specific behaviour of
the function under test.
2021-12-06 14:00:42 +00:00
Ben Thorner
5bf3fe6c0f Clarify no callbacks are sent in timeout test
This now complements the test below it, which we will refactor to
be consistent in the next commit.
2021-12-06 14:00:41 +00:00
Ben Thorner
8b7e81958d Delete duplicate 'timeout' tests for notifications
These scenarios are already covered by the DAO tests. It's enough
to just check the DAO function is called as expected.

While sometimes it can be better to have more end-to-end tests, the
convention across much of this app is to do unit tests.
2021-12-06 14:00:40 +00:00
Ben Thorner
c3e11d676f Remove unnecessary test_request_context manager
This doesn't affect how the tests run and just adds complexity.
2021-12-06 14:00:39 +00:00
Ben Thorner
05bd26d444 Fix names for a few tests in test_nightly_tasks.py
I find it really difficult to visually parse test files unless we
have a consistent convention for how we name our test functions.
In most of our tests the name of the test function starts with the
name of the function under test.
2021-12-06 14:00:38 +00:00
Ben Thorner
0318229216 Stop 'timing out' old 'created' notifications
This is being replaced with a new alert and runbook [1]. It's not
always appropriate to change the status to 'technical-failure', and
the new alert means we'll act to fix the underlying issue promptly.

We'll look at tidying up the remaining code in the next commits.

[1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created
2021-12-06 14:00:36 +00:00
Ben Thorner
f96ba5361a Add new task to alert about created email / sms
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.

In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
2021-12-06 14:00:31 +00:00
Ben Thorner
528223ed61 Use central NotifyCelery base class in utils
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
2021-11-16 13:58:12 +00:00
Chris Hill-Scott
0236318189 Republish gov.uk/alerts every night to clear down planned tests
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.

We need to remove the banner when the day is over.

The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.

1. 5a274af6d0/app/models/alerts.py (L38-L44)
2021-11-15 14:23:32 +00:00
Ben Thorner
d66c68d6d6 Merge pull request #3364 from alphagov/celery-headers-request-id-180213914
Move Celery task Request ID injection into headers
2021-11-12 11:10:29 +00:00
Ben Thorner
89a8dd1a03 Move Celery task Request ID injection into headers
Previously we passed along this piece of state via the kwargs for
a task, but this runs the risk of the task accidentally receiving
the extra kwarg unless we've covered all the code paths that could
invoke it directly e.g. retries don't invoke __call__.

This switches to using Celery "headers" to pass the extra state. It
turns out that a Celery has two "header" concepts, which leads to
some confusion and even a bug with the framework [1]:

- In older (pre v4.4) versions of Celery, the "headers" specified
by apply_async() would become _the_ headers in the message that
gets passed around workers, etc. These would be available later on
via "self.request.headers".

- Since Celery protocol v2, the meaning of "headers" in the message
changed to become (basically) _all_ metadata about the task [2],
with the "headers" option in apply_async() being merged [3] into
the big dict of metadata.

This makes using headers a bit confusing unfortunately, since the
data structure we put in is subtly different to what comes out in
the request context. Nonetheless, it still works. I've added some
comments to try and clarify it.

Note that one of the original tests is no longer necessary, since we
don't need to worry about argument passing styles with headers.

[1]: https://github.com/celery/celery/issues/4875
[2]: 663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)
[3]: 681a922220/celery/app/amqp.py (L495)
2021-11-10 18:03:40 +00:00
Katie Smith
3d4796c924 Add task to resanitise and replace a PDF for precompiled letter
This adds a task which is designed to be used if we want to recreate the
PDF for a precompiled letter (either one that has been created using the
API or one that has been uploaded through the website).

The task takes the `notification_id` of the letter and passes template
preview the details it needs in order to sanitise the original file and
then replace the version in the letters-pdf bucket with the freshly
sanitised version.
2021-11-10 09:51:31 +00:00
Ben Thorner
d1586a8f81 CC DVLA in tickets about outstanding letters
Previously we sent them emails about this manually. We also tried
a Zendesk macro/trigger approach, but using a CC means:

- We can control the behaviour ourselves (Zendesk triggers can only
be edited by admins outside our team).

- We keep the DVLA notification approach consistent and in one place,
so notifications always go to the same people.

- Any further (public) updates to the ticket will also trigger a
notification to DVLA (previous trigger only notified on creation).
2021-10-29 11:46:29 +01:00
David McDonald
5a51ab6131 Bug fix: update normalised_to, not just to after letter sanitise
When a precompiled letter is sent to us, we set the `to` field as
'Provided as PDF' in
1c1023a877/app/v2/notifications/post_notifications.py (L100-L104)

This then also sets `normalised_to` as `providedaspdf`.

However, when template preview sanitises the letter, pulls out the
address and gives it to the API, we were only setting `to` to be
the new address and had forgotten to also amend `normalised_to` to
be the normalised version. This meant that for all these letters
we accidentally left `normalised_to` as `providedaspdf`. The impact
of this was that we can not then search for these letters in the
admin user interface as they rely on the `normalised_to` field
containing the recipient address.

This commit fixes that bug by also setting the `normalised_to`
field
2021-10-27 11:56:25 +01:00
Ben Thorner
d703251b13 Merge pull request #3348 from alphagov/better-callback-stats-180016688
Include status in stats about delivery times
2021-10-22 11:59:24 +01:00
Ben Thorner
f974108934 Include status in stats about delivery times
Previously these metrics weren't very useful because they could be
skewed by long timings for failed notifications, which can take up
to 72 hours to deliver. I'm intentionally not trying to have a dual
running period (with the old and new names) because:

- We don't use the current stats for anything (checking Grafana).

- The current stats get turned into a "bucket" metric in Prometheus
[1][2], which isn't very useful because it can only tell us the mean
time to deliver, but we're actually interested in percentiles.

Switching to a new naming is an opportunity to fix the raw data and
the way it's aggregated, using the same kind of "summary" metric that
we now use for stats about our Celery tasks [3].

[1]: c330a8ac8a/paas/statsd/statsd-mapping.yml (L82)
[2]: https://prometheus.io/docs/practices/histograms/#quantiles
[3]: https://github.com/alphagov/notifications-aws/pull/890
2021-10-20 17:22:59 +01:00
Leo Hemsted
0b8c6ef263 Merge pull request #3339 from alphagov/letter-runbook-link
tweak zendesk message for no ack files alert
2021-10-20 15:23:33 +01:00
Pea Tyczynska
1b6f9505da Call publish-govuk-alerts task when alert expires
The `auto-expire-broadcast-messages` task checks for expired broadcasts
at five minute intervals. This change now calls the
`publish-govuk-alerts` task in govuk-alerts if there are expired
broadcasts so that the site is updated.

Co-authored-by: Katie Smith <katie.smith@digital.cabinet-office.gov.uk>
2021-10-18 08:41:25 +01:00
Katie Smith
04bfd6bfdb Trigger task to publish alerts when sending or cancelling alert
When we send or cancel a broadcast message, we now trigger a task
in govuk-alerts repo that polls our API for alerts and
publishes a fresh list of alerts.

Co-authored-by: Pea Tyczynska <pea.tyczynska@digital.cabinet-office.gov.uk>
2021-10-18 08:41:24 +01:00
Leo Hemsted
b8c4e19072 tweak zendesk message for no ack files alert
include a link to a runbook entry.

also the list of acknowledgement files can be very long, so make that
the last thing, and use new lines to space out the message.
2021-10-08 13:45:02 +01:00
Katie Smith
58597653df Update how "sending to TV numbers" Zendesk tickets are created 2021-09-29 11:26:20 +01:00
Katie Smith
0c0c7f4478 Update how "letters still created status" Zendesk tickets are created 2021-09-29 11:23:28 +01:00
Katie Smith
2f66e38fb9 Update how "missing ackfile for letters" Zendesk tickets are created 2021-09-29 11:10:50 +01:00
Katie Smith
64c0a3fb9d Update how 'letters still sending' Zendesk tickets are created
These now use the new Zendesk form.
2021-09-29 11:07:37 +01:00
Katie Smith
b114dadcae Update how pending virus check Zendesk tickets are created
This updates the tickets that are created when the
`check_if_letters_still_pending_virus_check` scheduled task detects
letters in the `pending-virus-check` state.
2021-09-29 11:03:48 +01:00
Katie Smith
9ff0ca0363 Update how live broadcast Zendesk tickets are created
These now use the Notify Form in Zendesk
2021-09-29 10:59:07 +01:00