Commit Graph

893 Commits

Author SHA1 Message Date
Katie Smith
3bfd084a77 Simplify send_delivery_status_to_service code
Now that https://github.com/alphagov/notifications-api/pull/3184 has
been deployed for a while, the `send_delivery_status_to_service` task will
always have `template_id` and `template_version` being passed in. This
means we don't need to check if those fields are there.
2021-03-30 08:50:42 +01:00
David McDonald
6d410daae4 Remove the emergency alerts canary
See https://github.com/alphagov/notifications-broadcasts-infra/pull/197
for why we no longer need this and we get to delete some code!
2021-03-26 18:31:53 +00:00
Pea Tyczynska
52c529ab3a Use personalisation to set client_reference for letters
which were sent through Notify interface only. This is done
to avoid performance dip from additional operation for
other notification types.
2021-03-24 14:55:10 +00:00
Ben Thorner
b2b14f39a3 Merge pull request #3183 from alphagov/remove-crown-letter-filename
Remove non/crown indicator in letter filenames
2021-03-24 13:06:58 +00:00
Katie Smith
27b3cece7d Send template id and version with delivery status callback
This adds the `template_id` and `template_version` fields to the data
sent to services from the `send_delivery_status_to_service` task.

We need to account for the task not being passed these fields at first
since there might be tasks retrying which don't have that data. Once all
tasks have been called with the new fields we can then update the code
to assume they are always there.

Since we only send delivery status callbacks for SMS and emails, I've
removed the tests where we call that task with letters.
2021-03-24 10:55:45 +00:00
Ben Thorner
8219b3c032 Remove non/crown indicator in letter filenames
This is not required by DVLA and since [1] we no longer care about
the end of letter filenames when collating them, so removing it is
safe to do. Note that the name of the ZIP files of collated letters
is based on a hash of the filenames, which needed updating in tests.

Before merging this we need to do a test run in Staging, so DVLA can
check that a mixture of the old / new filenames won't cause issues.

[1]: https://github.com/alphagov/notifications-api/pull/3172
2021-03-18 13:05:12 +00:00
Katie Smith
3b78f863d5 Check for incomplete pending jobs
We have a scheduled task that was checking for jobs still in progress.
We saw a case where a scheduled job was stuck in a `pending` status as a
result of an app shutting down. This changes the `check_job_status` task
so that it also checks for scheduled jobs which are still pending after
30 minutes.
2021-03-18 08:24:36 +00:00
Ben Thorner
c76e789f1e Reduce extra S3 ops when working with letter PDFs
Previously we did some unnecessary work:

- Collate task. This had one S3 request to get a summary of the object,
which was then used in another request to get the full object. We only
need the size of the object, which is included in the summary [1].

- Archive task. This had one S3 request to get a summary of the object,
which was then used to make another request to delete it. We still need
both requests, but we can remove the S3.Object in the middle.

[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary
2021-03-16 12:53:13 +00:00
Leo Hemsted
6784ae62a6 Raise Exception if letter PDF not in S3
Previously, the function would just return a presumed filename. Now that
it actually checks s3, if the file doesn't exist it'll raise an
exception. By default that's a StopIteration at the end of the bucket
iterator, which isn't ideal as this will get supressed if the function
is called within a generator loop further up or anything.

There are a couple of places where we expect the file may not exist, so
we define a custom exception to rescue specifically here. I did consider
subclassing boto's ClientError, but this wasn't straightforward as the
constructor expects to know the operation that failed, which for me is a
signal that it's not an appropriate (re-)use of the class.
2021-03-15 17:18:11 +00:00
Ben Thorner
b43a367d5f Relax lookup of letter PDFs in S3 buckets
Previously we generated the filename we expected a letter PDF to be
stored at in S3, and used that to retrieve it. However, the generated
filename can change over the course of a notification's lifetime e.g.
if the service changes from crown ('.C.') to non-crown ('.N.').

The prefix of the filename is stable: it's based on properties of the
notification - reference and creation - that don't change. This commit
changes the way we interact with letter PDFs in S3:

- Uploading uses the original method to generate the full file name.
The method is renamed to 'generate_' to distinguish it from the new one.

- Downloading uses a new 'find_' method to get the filename using just
its prefix, which makes it agnostic to changes in the filename suffix.

Making this change helps to decouple our code from the requirements DVLA
have on the filenames. While it means more traffic to S3, we rely on S3
in any case to download the files. From experience, we know S3 is highly
reliable and performant, so don't anticipate any issues.

In the tests we favour using moto to mock S3, so that the behaviour is
realistic. There are a couple of places where we just mock the method,
since what it returns isn't important for the test.

Note that, since the new method requires a notification object, we need
to change a query in one place, the columns of which were only selected
to appease the original method to generate a filename.
2021-03-15 13:55:44 +00:00
David McDonald
41d95378ea Remove everything for the performance platform
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats

We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
2021-03-15 12:04:53 +00:00
David McDonald
8325431462 Move saving of processing time into separate task
We current do this as part of send-daily-performance-platform-stats but
now this moves it into its own separate task. This is for two reasons
- we will shortly get rid of the send-daily-performance-platform-stats
  task as we no longer will need to send anything to performance
  platform
- even if we did decide to keep the task
  send-daily-performance-platform-stats and remove the specific bits
  that relate to the performance platform, it's probably nicer to
  rewrite the new task from scratch to make sure it's all clear and easy
  to understand
2021-03-15 11:44:01 +00:00
Ben Thorner
0379d721e5 Add missing statsd timers to celery tasks
All other tasks in app/celery/*_tasks.py have timers on them. Some
of these timers will be useful to check before/after performance as
a way to reassure ourselves about the impact of [1].

[1]: https://github.com/alphagov/notifications-api/pull/3172
2021-03-12 12:32:22 +00:00
Ben Thorner
a91fde2fda Run auto-correct on app/ and tests/ 2021-03-12 11:45:45 +00:00
Rebecca Law
acfb759cb9 Change DVLA_EMAIL_ADDRESS to a list 2021-02-26 11:21:16 +00:00
David McDonald
82e5a1804b Merge pull request #3155 from alphagov/migrate-broadcast-settings
Backfill services_broadcast_settings table
2021-02-25 12:16:36 +00:00
Pea Tyczynska
4fc3af9811 Add date to personalisation for DVLA email
Personalisation was missing date attribute. The email still got sent
tonight, just it didn't have a value for date placeholder.
2021-02-24 10:22:22 +00:00
David McDonald
3ea86bfb48 Remove hardcoded default to use test channel
There is no need for a default now as every broadcast service has set on
it which broadcast channel to use.
2021-02-23 17:15:07 +00:00
Pea Tyczynska
e0c73ac342 Send daily email with letter and sheet volumes to DVLA 2021-02-23 15:13:19 +00:00
Pea Tyczynska
6dab63130d Make import order alphabetical 2021-02-23 15:13:19 +00:00
Leo Hemsted
0088bcd98b only retry if the broadcast message task is in sending
previously we would retry if the task was queued up for retry but the
status is in "received-ack" or "received-err". We don't expect that a
task will be retried after getting this status, but if there are
duplicate tasks that could happen. Lets plan for the worst by saying
"only process a retry if the task is currently in sending".

this way, if a duplicate task is on retry and the first task goes
through succesfully, the duplicate task will give up.
2021-02-18 12:03:36 +00:00
Ben Thorner
474b93f183 Remove redundant (renamed) letters task
This was renamed in [1], and enough time has elapsed that instances
of the task should all have finished processing.

[1]: 5d6f2da155
2021-02-17 12:57:50 +00:00
Katie Smith
6b8ebb3421 Fix linting errors 2021-02-16 09:03:38 +00:00
Leo Hemsted
fed0d4c40e Merge pull request #3137 from alphagov/revert-revert-revert
Bring back retry logic
2021-02-15 12:21:13 +00:00
David McDonald
a1e539e785 Merge pull request #3132 from alphagov/created-letters-runbook
Improvements to our letter checking tasks
2021-02-12 16:30:42 +00:00
David McDonald
5526c89c34 Rename task and function for clarity
This doesn't just relate to precompiled letters, it's actually just
checking that there are not any letters still waiting for a virus check
that should not be. This change to the naming makes it more accurate
and therefore easy to understand
2021-02-10 15:23:53 +00:00
David McDonald
1b9d8252ec Rename task and function for clarity
This doesn't just relate to templated letters, it's actually just
checking that there are not any letters still in created that should not
be. This change to the naming makes it more accurate and therefore easy
to understand
2021-02-10 15:23:52 +00:00
David McDonald
3c0e609cc9 Add link to runbook for created letter alert
We've got the entry in the runbook, this will make it clear to go and
look at it.
2021-02-10 15:23:51 +00:00
Leo Hemsted
4f89be6944 Revert "Merge pull request #3125 from alphagov/revert-retry"
This reverts commit 6b9a50beff, reversing
changes made to 33f93dfea2.
2021-02-09 17:01:04 +00:00
Katie Smith
5eebcf6452 Put service callback retries on a different queue
At the moment, if a service callback fails, it will get put on the retry queue.
This causes a potential problem though:

If a service's callback server goes down, we may generate a lot of retries and
this may then put a lot of items on the retry queue. The retry queue is also
responsible for other important parts of Notify such as retrying message
delivery and we don't want a service's callback server going down to have an
impact on the rest of Notify.

Putting the retries on a different queue means that tasks get processed
faster than if they were put back on the same 'service-callbacks' queue.
2021-02-09 13:31:16 +00:00
Leo Hemsted
bee0059e53 Revert "Merge pull request #3101 from alphagov/retry-broadcasts"
This reverts commit 1bd99c779d, reversing
changes made to d390eb2cac.
2021-02-08 11:02:34 +00:00
Leo Hemsted
49e6ec1ead Revert "Merge pull request #3123 from alphagov/retry-loop-fix"
This reverts commit 541a765811, reversing
changes made to 6a9ac654a6.
2021-02-08 11:01:33 +00:00
Leo Hemsted
d582e35471 dont try and send broadccast event if it's already in technical-failure
this gives us an option to manually set a status in the database and
avoid things being stuck in a retry loop forever
2021-02-05 12:52:37 +00:00
Leo Hemsted
0ddebc63a8 reduce broadcast retry delay to 4 mins and drop prefetch.
### The facts

* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
  that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
  seconds, sqs assumes the celery app has died, and puts it back on the
  queue.

### The situation

A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.

(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)

### The fix

#### Set the max retry delay for broadcast tasks to 240 seconds

Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.

#### Set the prefetch limit to 1 for broadcast workers

This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.

Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.

https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
2021-02-05 12:49:51 +00:00
Leo Hemsted
bbae209200 check provider message status etc when sending rather than when retrying
previously if we were deciding whether to retry or not, it meant that
future events wouldn't have context of what the task is doing. We'd
run into issues with not knowing what references to include when
updating/cancelling in future events.

Instead of deciding whether to retry or not, always retry. Instead, when
any event sends, regardless of whether it is a first time or a retry,
check the status of previous events for that broadcast message. There
are a few things that will mean we don't send.

* If the finishes_at time has already elapsed (ie: we have been trying
  to resend this message and haven't had any luck and now the data is
  obselete)
* A previous event has no provider message (this means that we never
  picked the previous event off the queue for some reason)
* A previous event has a provider message that has anything other than
  an ack response. This includes sending (the old message is currently
  being sent), and technical-failure/returned-error (the old message is
  currently in the retry loop, having experienced issues).
2021-02-03 18:11:52 +00:00
Leo Hemsted
96a0935d1c update broadcast provider message status on success/error
so we can distinguish errorring messages that are currently retrying
from those that sent succesfully.
2021-02-03 18:03:16 +00:00
Leo Hemsted
3dcbfc3612 re-use existing provider message if task retries
previously it would crash with a unique constraint error. now, grab the
previous message.
2021-02-03 18:01:54 +00:00
Leo Hemsted
ac34fb9c05 retry sending broadcasts
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:

Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.

Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...

...UNLESS WE DO!

Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.

Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
2021-02-03 16:43:01 +00:00
David McDonald
d390eb2cac Merge pull request #3112 from alphagov/channel-restriction
Set broadcast channel as a service setting
2021-02-03 11:46:04 +00:00
David McDonald
f441d5b4ce Add comment about service channels for updating 2021-02-03 11:37:02 +00:00
David McDonald
070b79c27e Downgrade exceptions to warnings to reduce emails
We already trigger a zendesk ticket for these two cases, meaning that
whenever we get this situation, we get 3 emails. One for the zendesk
ticket, one from logit raising the fact an exception was raised and one
from cloudwatch raising the fact an exception was raised.

We don't need all these emails, a zendesk ticket is sufficient.
Downgrading to a warning means this event will still be findable in our
logs however.
2021-02-02 15:10:26 +00:00
David McDonald
f90b479c8d Use service setting to pick broadcast channel
This falls back to the "test" channel if they do not have a
ServiceBroadcastSetting for the moment, but we intend in future PRs to
enforce that all broadcast services will have this property.
2021-02-01 14:10:41 +00:00
David McDonald
2aad3163e6 Allow CBC proxy client to take channel
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`

We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
2021-02-01 14:10:38 +00:00
David McDonald
86ea89cf76 Merge pull request #3098 from alphagov/downgrade-to-warning
Downgrade SMS provider request exceptions to warnings
2021-01-29 11:52:10 +00:00
David McDonald
ac6837cde5 Downgrade exception to warning for provider API call
When we send an HTTP request to our SMS providers, there is a
chance we get a 5xx status code back from them. Currently we log this as
two different exception level logs.

If a provider has a funny few minutes, we could end up with
hundreds of exceptions thrown and pagerduty waking someone up in the
middle of the night. These problems tend to pretty quickly fix
themselves as we balance traffic from one SMS to the other SMS provider
within 5 minutes.

By downgrading both exceptions to warning in the case of a
`SmsClientResponseException`, we will reduce the change of waking us up
in the middle of the night for no reason.

If the error is not a `SmsClientResponseException`, then we will still
log at the exception level as before as this is more unexpected and we
may want to be alerted sooner.

What we still want to happen though is that let's say both SMS providers
went down at the same time for 1 hour. We don't want our tasks to just
sit there, retrying every 5 minutes for the whole time without us being
aware (so we can at least raise a statuspage update). Luckily we will
still be alerted because our smoke tests will fail after 10 minutes and
raise a p1:
https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21
2021-01-18 17:00:21 +00:00
Chris Hill-Scott
4eb4ea1772 Use cache for tasks that save notifications
These tasks need to repeatedly get the same template and service from
the database. We should be able to improve their performance by getting
the template and service from the cache instead, like we do in the REST
endpoint code.
2021-01-18 10:25:24 +00:00
David McDonald
20627d96ea Put all broadcast tasks on the broadcast worker 2021-01-13 17:21:40 +00:00
David McDonald
c3ef23c771 Alert on 2nd class letters still in sending everyday
In 8285ef5f89
we turned off alerting on 2nd class letters still being in sending on
certain days of the week because we were only sending letters out on
Mon, Wed, Fri.

Now we have swapped back to sending out 2nd class letters on all
workdays so this change can be reverted. Note, I haven't reverted the
commit exactly but more so the behaviour, whilst leaving in some tests
to explicitly test 2nd class letters for the alert in case we change
this again.
2021-01-13 11:21:27 +00:00
David McDonald
977554781f Add better logging message for tech failure
So we can easily identify which notification ID failed
2020-12-30 17:28:21 +00:00
David McDonald
2480f91667 Raise better exception on InvalidParameterValue error
There are several reasons why we might get an `InvalidParameterValue`
from the SES API. One, as correctly identified before in
https://github.com/alphagov/notifications-api/pull/713/files
is if we allow an email address on our side that SES rejects.

However, there are other types of errors that could cause an
`InvalidParameterValue`. One example is a `Header too long: 'Subject'`
error that we have seen happen in production. This shouldn't raise an
`InvalidEmailError` as that is not appropriate.

Therefore, we introduce a new exception
`EmailClientNonRetryableException`, that represents any exception back
from an email client that we can use whenever we get a
`InvalidParameterValue` error.

Note, I chose `EmailClientNonRetryableException` rather than
`SESClientNonRetryableException` as our code needs to catch this
exception and it shouldn't be aware of what email client is being used,
it just needs to know that it came from one of the email clients (if in
time we have more than one).

In time, we may wish to extend the approach of having generic
`EmailClient` exceptions and `SMSClient` exceptions as this should be
the most extendable pattern and a good abstraction.
2020-12-30 17:18:16 +00:00