Commit Graph

1003 Commits

Author SHA1 Message Date
Ben Thorner
779b8e941f Rewrite broadcast Zendesk alert at approval time
The new alert happens earlier but is otherwise the same:

- We only create a ticket in Production.
- We only create a ticket on approval.

I took this opportunity to refactor the alert as a private function
and test this specifically in detail to avoid lots of repetitive
mocks, which are required when calling the main "update" function.

One test I haven't preserved was for when the "names" array is empty,
as this was added for a legacy data integrity scenario [^1].

[^1]: bf0bf4e31c
2022-04-05 12:57:08 +01:00
Ben Thorner
3988a6cd07 Include exception info in SMS warning log
This makes it easier to debug failures when adding a new provider.
2022-03-30 13:36:56 +01:00
Ben Thorner
b439fd0718 Add boilerplate for Reach SMS callbacks
This is enough to update a notification in DB:

1. First create a notification in the UI and sent it.

2. Then reset its attributes to pretend it's for Reach.

    update notifications set
      sent_at = null,
      sent_by = null,
      notification_status='sending'
    where id='some-uuid';

3. Change "notification_id" to "<some-uuid>" in the code.

4. Call the boilerplate endpoint for Reach callbacks.

    curl -X POST localhost:6011/notifications/sms/reach

Interestingly there's no foreign key constraint on "sent_by" in the
DB, so this just works: the notification is updated.
2022-03-24 16:56:33 +00:00
Leo Hemsted
2fbe9e85ac Merge pull request #3479 from alphagov/auto-retry-stuck-av-letters
automatically retry letters stuck in pending-virus-scan
2022-03-15 11:43:42 +00:00
Leo Hemsted
9e8df8b623 remove "letters stuck pending av" runbook
there's not anything we know we need to do now that we resolve stuck
letters automatically. Letters couuld still get into this state, so it's
worth alerting us. However, we don't have anything concrete that we know
how to fix these letters, so we should just remove the runbook entirely.
2022-03-10 14:10:01 +00:00
David McDonald
0d952b4d8c Reduce timeout for service callback attempt to 5 seconds
It is currently 60 seconds but we have had two incidents in the
past week where there is a connection error talking to a service
and the request takes up to 60 seconds before failing. When this
happens, if there are a few of these callbacks then all of them
will completely hog the service callback worker and build up a big
queue of all the other service callbacks.

5 seconds has been chosen as that is still a pretty decent length
time for a simple web request that should just be giving them a
little bit of information for them to store. 5 seconds should be a
sufficient enough reduction that we dramatically reduce this problem
for the moment.

Open to this number
being changed in the future based on how we see it perform.
2022-03-08 13:05:32 +00:00
Leo Hemsted
00259893f1 automatically retry letters stuck in pending-virus-scan
Since sept 2019 we've had to log on to production around once every
twenty days to restart the virus scan task for a letter. Most of the
time this is just a case of making sure the file is in the scan bucket,
and then triggering the task. If the file isn't in the scan bucket we'd
need to do some more manual investigation to find out exactly where the
file got stuck, but I can only remember times when it's been in the scan
bucket.

So if the file is in the scan bucket, we can just check that with code
and kick the task off automatically.
2022-03-07 18:31:46 +00:00
Katie Smith
514bd48614 Update flake8-bugbear from 20.11.1 to 22.1.11
And ignore a warning, since I did not think that in this case "Using
.strip() with multi-character strings is misleading the reader".
2022-03-02 16:51:09 +00:00
Ben Thorner
a69d1635a1 Update FactStatus table in bulk for each service
Previously we were looping over data from the Notifications/History
table and then shovelling it into the status table, one row at a time
- plus an extra delete to clean up any existing data.

This replaces that with a batch insertion, similar to how we archive
notifications [1], but using a simple subquery (via "from_select" [2])
instead of a temporary table.

To make the select compatible with the insert, I've used "literal"
to inject the constant pieces of data, so each row has everything it
needs to go into the status table.

[1]: 9ce6d2fe92/app/dao/notifications_dao.py (L295)
[2]: https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select
2022-02-16 13:40:05 +00:00
Ben Thorner
ef231d5de7 Fix task name and action in status task logs 2022-02-16 11:45:45 +00:00
Ben Thorner
7f4b140f97 Rename function to make it consistent
This is consistent with the new "on_date" function. It was going
off the edge of my screen before in some parts of the code.
2022-02-09 17:39:08 +00:00
Ben Thorner
1213463b8e Only aggregate status when necessary for a service
This takes a similar approach to the nightly deletion task so that
we only create sub-tasks when there are actually notifications to
aggregate for a given type and day [1].

We're making this change to stop the duplication errors we're getting
at the moment and ensure the task can scale to more messages and more
services. There are two parts to this:

- Each subtask should now run within the 5 minute visibility timeout.
However, they may still be duplicated if the parent task overruns [2].

- The parent task creates a mininal number of subtasks, and the query
to determine this is very fast for a normal process day (milliseconds).

Since all tasks will run quickly, there should be no more duplication.

In order to test this more nuanced task, I rewrote the tests:

- One test checks the subtask is called correctly.
- One test checks we create all the right subtasks.

[1]: https://github.com/alphagov/notifications-api/pull/3381
[2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl
2022-02-09 17:39:07 +00:00
Ben Thorner
c8db58d0e8 Reorder loops for creation status agg sub tasks
This will help tailor the innermost loop on services.
2022-02-09 17:39:06 +00:00
Ben Thorner
d6678b6a70 Remove unnecessary logs from status aggreagtion
These can be inferred elsewhere:

- Task creation is obvious from task execution. If we're concerned
about a specific service, we can check the updated times on the DB
records, since all records are recreated each time this runs.

- Task starting is already logged.

- Task completion is already logged. The number of rows updated can
also be inferred from the DB.

The log I've found useful is the one about fetching the data, and
I've also added another to time how long it takes to insert the data,
as both could be sources of poor performance.

Arguably we should use metrics for this sort of thing, but logs are
easier in practice for the metric systems we have.
2022-02-09 17:39:05 +00:00
Ben Thorner
018a253b6f Revert "Revert running status aggregation in parallel"
This reverts commit 0f6dea0deb.
2022-02-09 17:39:00 +00:00
Chris Hill-Scott
7f72d3a60f Bump utils to 53.0.0
Changes:

53.0.0
---

* `notifications_utils.columns.Columns` has moved to
  `notifications_utils.insensitive_dict.InsensitiveDict`
* `notifications_utils.columns.Rows` has moved to
  `notifications_utils.recipients.Rows`
* `notifications_utils.columns.Cell` has moved to
  `notifications_utils.recipients.Cell`

52.0.0
---

* Deprecate the following unused `redis_client` functions:
  - `redis_client.increment_hash_value`
  - `redis_client.decrement_hash_value`
  - `redis_client.get_all_from_hash`
  - `redis_client.set_hash_and_expire`
  - `redis_client.expire`

51.3.1
---

* Bump govuk-bank-holidays to cache holidays for next year.
2022-02-08 09:45:10 +00:00
Rebecca Law
09c8fbe982 Merge pull request #3418 from alphagov/letters-too-long
Mark letters as validation-failed if the templated letter is too long.
2022-02-02 08:30:50 +00:00
Rebecca Law
c01c81326c Update log message to something a little easier to read and query for. 2022-01-24 12:25:53 +00:00
Leo Hemsted
246016a894 don't log if we dont delete anything for a service
we try and delete for lots of services. this includes services that
don't actually have anything to delete that day. that might be because
they had a custom data retention so we always go to check them, or
because they only sent test notifications (which we'll delete but not
include in the count in the log line). we don't really need to see log
lines saying that we didn't delete anything for that service - that's
just a long list of boring log messages that will hide the actual
interesting stuff - which services we did delete content for.
2022-01-21 11:04:37 +00:00
Ben Thorner
0f6dea0deb Revert running status aggregation in parallel
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
2022-01-20 12:22:33 +00:00
Rebecca Law
6cd7a23d3c If there is an invalid letter that has not been updated to validation-failed because the update-validation-failed-for-templated-letter has not been picked up off the letter-tasks queue and the collate-letter-pdfs-to-be-sent has started.
1. The number of letters that we send to DVLA will be not be correct (see 20ead82463/app/celery/letters_pdf_tasks.py (L136))
This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported.
2. When we get the PDF from S3 we will get a file not found 20ead82463/app/celery/letters_pdf_tasks.py (L244)
The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions.

Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.
2022-01-19 08:31:19 +00:00
Rebecca Law
841a4fc22f Mark letters as validation-failed if the templated letter is too long.
It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications.
NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload.

A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task.

New work flow for a letter in brief:
API - receives POST /v2/notifications
:: save to db
:: put CREATE_LETTERS_PDF task on queue for template preview to consume
TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF
:: create PDF
:: count pages of PDF
:: IF page count exceeds 10 pages
	 put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters)
	 put `update-validation-failed-for-templated-letter` task on the queue for the API to consume
   ELSE
     put PDF in the `letters-pdf` bucket
     put `update-billable-units-for-letter` task on the queue
API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task
:: IF `update-billable-units-for-letter` task:
   	update billable units for notification as usual
:: ELSE `update-validation-failed-for-templated-letter`:
   	update notification_status = `validation-failed`
ADMIN - view notification page for letter
:: show validation letter for templated letter

There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app.

Deployment plan

Deploy Admin first
Deploy API
Deploy template-preview

Related PRs:
alphagov/notifications-template-preview#619
alphagov/notifications-admin#4107

https://www.pivotaltracker.com/story/show/169209742
2022-01-19 08:29:48 +00:00
Ben Thorner
9686595fa8 Minor tweaks to address comments on the PR
To address:

- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786867994
- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786853329
- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786848793
- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786214794
2022-01-18 16:56:53 +00:00
Ben Thorner
9182ebf4e5 Parallelise status aggregation by service and day
This follows a similar approach as [1]. Recently we've seen lots
of errors from this task, which we think are a consequence of it
doing too much work and tripping Celery's visibility timeout.

While we can optimise the query [2], it's likely the errors will
return as the number of live services grows. Parallelising the
aggregation now will make it more futureproof.

[1]: https://github.com/alphagov/notifications-api/pull/3397
[2]: https://github.com/alphagov/notifications-api/pull/3417
2022-01-12 15:47:59 +00:00
Ben Thorner
d772ae6b46 Standardise logs for status aggregation tasks
This will make it easier to parallelise by service later on.
2022-01-12 15:47:57 +00:00
Ben Thorner
4feed950c4 DRY-up loops to kick off status aggregation tasks
This will make it easier to parallelise by service in the following
commits, since we only have one loop to change.
2022-01-12 15:47:56 +00:00
Ben Thorner
ddbf556486 Rewrite task to aggregate status by service
This is a step towards parallelising the task by service and day.
2022-01-12 15:47:53 +00:00
Ben Thorner
a7b39a930c Add comment to explain status aggregation approach
This relates to the performance optimisation work we're doing [1].
Before optimising the task, it's worth asking if we can do less -
the comment explains why it has to be this way.

Some references to back up the comment:

- We do status updates in either table [2].
- We don't allow duplicate receipts for emails [3].
- We don't allow duplicate receipts for SMS [4].
- We don't expect duplicate receipts for letters.

This is something we would need to revisit if we want to support
additional status updates - we could reject based on the age of the
notification, rather than the status.

[1]: https://github.com/alphagov/notifications-api/pull/3417
[2]: 20ead82463/app/dao/notifications_dao.py (L538)
[3]: 20ead82463/app/celery/process_ses_receipts_tasks.py (L58)
[4]: 20ead82463/app/dao/notifications_dao.py (L129-L135)
2022-01-10 18:15:54 +00:00
Leo Hemsted
228d72dc8f update log messages in delete task.
less prose, clearer output. (hopefully)
2021-12-14 15:24:35 +00:00
Leo Hemsted
49cc1b643f split delete task up into per service
we really don't gain anything by running each service delete in sequence
- we get the services, and then just loop through them deleting per
service. By deleting per service in separate tasks, we can take
advantage of parallelism. the only thing we lose is some log lines but I
don't think we're that interested in them.

only set query limit at the move_notifications dao function - the task
doesn't really care about the technical implementation of how it deletes
the notifications
2021-12-14 15:24:34 +00:00
Ben Thorner
c8cf057eba Record providers we time out notifications for
This will help us monitor issues with delivery receipts and keep
track of provider performance over time.

I'm not concerned about performance here:

- The number of notifications to time out is usually small.
- This task only runs once a day.
- Calls to StatsD are quick and cheap.
2021-12-14 13:04:39 +00:00
Ben Thorner
87cd40d00a Scale timeout task to work on arbitrary volumes
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.

This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.

[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
2021-12-13 17:14:28 +00:00
Ben Thorner
76aeab24ce Rewrite DAO timeout method to take cutoff_time
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
2021-12-13 16:56:21 +00:00
Ben Thorner
ab4cb029df Remove alert for email / sms in created
In response to [1].

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988

It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.

It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
2021-12-06 14:11:42 +00:00
Ben Thorner
04da017558 DRY-up conditionally creating callback tasks
This removes 3 duplicate instances of the same code, which is still
tested implicitly via test_process_ses_receipt_tasks [1]. In the
next commit we'll make this test more explicit, to reflect that it's
now being reused elsewhere and shouldn't change arbitrarily.

We do lose the "print" statement from the command instance of the
code, but I think that's a very tolerable loss.

[1]: 16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)
2021-12-06 14:11:34 +00:00
Ben Thorner
0318229216 Stop 'timing out' old 'created' notifications
This is being replaced with a new alert and runbook [1]. It's not
always appropriate to change the status to 'technical-failure', and
the new alert means we'll act to fix the underlying issue promptly.

We'll look at tidying up the remaining code in the next commits.

[1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created
2021-12-06 14:00:36 +00:00
Ben Thorner
f96ba5361a Add new task to alert about created email / sms
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.

In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
2021-12-06 14:00:31 +00:00
Leo Hemsted
f6d210f1e6 put delete tasks on the reporting worker
they share a lot with the reporting tasks (creating ft_billing and
ft_notification_status), in that they're run nightly, take a long time,
and we see error messages if they get run multiple times (due to
visibility timeout).

The periodic app has two concurrent processes - previously there was
just one delete task, which would use one of those processes, while the
other process would pick up anything else on the queue (at that time of
night, the regular provider switch checks and scheduled job checks).
However, when we switched to running the three delete notification types
separately, we saw visibility timeout issues - three tasks would be
created, all three would be picked up by one celery instance, the two
worker processes would start on two of them, and the third would sit on
the box, wait longer than the visibility timeout to be picked up (and
acknowledged), and so SQS would assume the task was lost and replay it.

it's queues all the way down!

By putting them on the reporting worker we can take advantage of tuning
that app (for example setting the prefetch multiplier to one) which is
designed to run large tasks. We've also got more concurrent workers on
this box, so we can run all three tasks at once.
2021-12-03 13:28:16 +00:00
Leo Hemsted
6bbec9f103 make delete notification tasks parallel by notification type
we used to do this until apr 2020. Let's try doing it again.
Back then, we had problems with timing. We did two things in spring
2020:

We moved to using an intermediary temp table [1]
We stopped the tasks being parallelised [2]

However, it turned out the real time saving was from changing what
services we delete for [3]. The task was actually CPU-bound rather than
DB-bound, so that's probably why having the tasks in parallel wasn't
helping, since they were all competing for the same CPU. It's worth
trying the parallel steps again now that we're no longer CPU bound.

Note: Temporary tables are in their own postgres schema, and are only
viewable by the current session (session == connection. Each celery
worker process has its own db connection). We don't need to worry about
separate workers both trying to use the same table at once.

I've also added a "DROP ON COMMIT" directive to the table definition
just to ensure it doesn't persist past the task even if there's an
exception. (This also drops on rollback).

Cronitor looks at the three functions separately so we don't need to worry
about the main task taking milliseconds where it used to take hours as
it isn't monitored itself.

I've also removed some unnecessary redundant exception logs.

[1] https://github.com/alphagov/notifications-api/pull/2767
[2] https://github.com/alphagov/notifications-api/pull/2798
[3] https://github.com/alphagov/notifications-api/pull/3381
2021-12-01 14:28:08 +00:00
Ben Thorner
bffca39223 Merge pull request #3373 from alphagov/centralise-celery-180213914
Use central NotifyCelery base class in utils
2021-11-17 12:09:28 +00:00
Ben Thorner
666ac1ab4f Log activity on all periodic Celery tasks
As stated in the comment, this would have been helpful during an
incident to give further reassurance that a task had at least
started running - at the time the only evidence for this was the
Cronitor dashboard itself, which we don't often look at.

I've removed other, equivalent "starting" logs, but kept those
that provide additional information in the log message.
2021-11-17 09:48:03 +00:00
Ben Thorner
528223ed61 Use central NotifyCelery base class in utils
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
2021-11-16 13:58:12 +00:00
Chris Hill-Scott
0236318189 Republish gov.uk/alerts every night to clear down planned tests
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.

We need to remove the banner when the day is over.

The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.

1. 5a274af6d0/app/models/alerts.py (L38-L44)
2021-11-15 14:23:32 +00:00
Ben Thorner
48e1482d90 Merge pull request #3366 from alphagov/celery-extend-request-id-180213914
Extend request tracing to cover Celery logs
2021-11-12 11:10:38 +00:00
Ben Thorner
d66c68d6d6 Merge pull request #3364 from alphagov/celery-headers-request-id-180213914
Move Celery task Request ID injection into headers
2021-11-12 11:10:29 +00:00
Ben Thorner
ac06529128 Enable request tracing on Celery success/fail logs
Previously these logs wouldn't have a Request ID attached since the
Celery hooks run after the __call__ method where we enable request
tracing for normal application logs. For the failure log especially
it will be useful to have this feature.
2021-11-10 18:04:20 +00:00
Ben Thorner
369a9f7521 Refactor queue_name and request_id into properties
This reduces the complexity of the original functions, which will
go up a bit in the next commit.
2021-11-10 18:04:19 +00:00
Ben Thorner
89a8dd1a03 Move Celery task Request ID injection into headers
Previously we passed along this piece of state via the kwargs for
a task, but this runs the risk of the task accidentally receiving
the extra kwarg unless we've covered all the code paths that could
invoke it directly e.g. retries don't invoke __call__.

This switches to using Celery "headers" to pass the extra state. It
turns out that a Celery has two "header" concepts, which leads to
some confusion and even a bug with the framework [1]:

- In older (pre v4.4) versions of Celery, the "headers" specified
by apply_async() would become _the_ headers in the message that
gets passed around workers, etc. These would be available later on
via "self.request.headers".

- Since Celery protocol v2, the meaning of "headers" in the message
changed to become (basically) _all_ metadata about the task [2],
with the "headers" option in apply_async() being merged [3] into
the big dict of metadata.

This makes using headers a bit confusing unfortunately, since the
data structure we put in is subtly different to what comes out in
the request context. Nonetheless, it still works. I've added some
comments to try and clarify it.

Note that one of the original tests is no longer necessary, since we
don't need to worry about argument passing styles with headers.

[1]: https://github.com/celery/celery/issues/4875
[2]: 663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)
[3]: 681a922220/celery/app/amqp.py (L495)
2021-11-10 18:03:40 +00:00
Katie Smith
3d4796c924 Add task to resanitise and replace a PDF for precompiled letter
This adds a task which is designed to be used if we want to recreate the
PDF for a precompiled letter (either one that has been created using the
API or one that has been uploaded through the website).

The task takes the `notification_id` of the letter and passes template
preview the details it needs in order to sanitise the original file and
then replace the version in the letters-pdf bucket with the freshly
sanitised version.
2021-11-10 09:51:31 +00:00
Ben Thorner
cdb43fbaf6 Only loop timeout task if there's more work
Previously this would repeat the task even the current iteration of
the loop had processed a non-full batch. This could cause the task
to error incorrectly if one or two notifications breach the timeout
threshold in between iterations.
2021-11-09 15:41:14 +00:00