This change will make our development environments closer to production
even if they aren't hooked up to the CBC proxy lambda functions.
Now in development, we will create the broadcast event and create tasks
for each broadcast provider event. We will still not create actual
broadcast provider message rows in the DB and talk to the CBC proxies.
This should be helpful in development to catch any issues we introduce
to do with sending broadcast messaging. In time we may wish to have some
fake CBC proxies in the AWS tools account that we can interact with to
make it even more realistic.
Previously we used a '@statsd' decorator to time and count Celery
tasks [1]. Using a decorator isn't ideal since we need to remember
to add it to every task we define. In addition, it's not possible
to use data like the task name and queue.
In order to avoid breaking existing stats, this duplicates them as
new StatsD metrics until we have sufficient data to update dashboards
using the old ones. Using the CeleryTask superclass to send metrics
avoids a future maintenance overhead, and means we can include more
useful data in the StatsD metric. Note that the new metrics will sit
in StatsD until we add a mapping for them [2].
StatsD automatically produces a 'count' stat for timing metrics, so
we don't need to increment a separate counter for successful tasks.
[1]: dea5828d0e/app/celery/tasks.py (L65)
[2]: https://github.com/alphagov/notifications-aws/blob/master/paas/statsd/statsd-mapping.yml
This is mainly so we can use it in the new metrics we send to StatsD
in the following commits, but it should also be useful in the logs.
I've taken the opportunity to make the log format consistent between
success / failure, and with our Template Preview app [1].
[1]: f456433a5a/app/celery/celery.py (L19)
config['NOTIFY_ENVIRONMENT'] is hardcoded to `'live'` in the Live config
class. The values as seen on the environment which we send real messages
from:
```
>>> json.loads(os.environ['VCAP_APPLICATION'])['space_name'] # what cloudfoundry sets
'production'
>>> os.environ['NOTIFY_ENVIRONMENT'] # we set this from cloudfoundry
'production'
>>> current_app.config['NOTIFY_ENVIRONMENT'] # hardcoded in the Live config
'live'
>>> current_app.config['NOTIFICATION_QUEUE_PREFIX'] # pulled from env var of same name
'live'
>>> current_app.config['ENV'] # this is an unrelated flask variable
'production'
```
it's important to keep tabs on when these things leave our system.
Sending a zendesk ticket that triggers a P1 is probably our simplest way
of notifying the team when this happens (it's what we do with out of
hours emergencies on the admin app too). We don't have any direct
pagerduty integrations from the api app, but we already have the zendesk
client hooked up.
After broadcasts go live, we may want to change this to a P2 (but even
then, there's arguments for keeping it P1 to start with I think).
Don't cause a P1 if it goes out on staging as that might be MNOs testing.
Now that https://github.com/alphagov/notifications-api/pull/3184 has
been deployed for a while, the `send_delivery_status_to_service` task will
always have `template_id` and `template_version` being passed in. This
means we don't need to check if those fields are there.
This adds the `template_id` and `template_version` fields to the data
sent to services from the `send_delivery_status_to_service` task.
We need to account for the task not being passed these fields at first
since there might be tasks retrying which don't have that data. Once all
tasks have been called with the new fields we can then update the code
to assume they are always there.
Since we only send delivery status callbacks for SMS and emails, I've
removed the tests where we call that task with letters.
This is not required by DVLA and since [1] we no longer care about
the end of letter filenames when collating them, so removing it is
safe to do. Note that the name of the ZIP files of collated letters
is based on a hash of the filenames, which needed updating in tests.
Before merging this we need to do a test run in Staging, so DVLA can
check that a mixture of the old / new filenames won't cause issues.
[1]: https://github.com/alphagov/notifications-api/pull/3172
We have a scheduled task that was checking for jobs still in progress.
We saw a case where a scheduled job was stuck in a `pending` status as a
result of an app shutting down. This changes the `check_job_status` task
so that it also checks for scheduled jobs which are still pending after
30 minutes.
Previously we did some unnecessary work:
- Collate task. This had one S3 request to get a summary of the object,
which was then used in another request to get the full object. We only
need the size of the object, which is included in the summary [1].
- Archive task. This had one S3 request to get a summary of the object,
which was then used to make another request to delete it. We still need
both requests, but we can remove the S3.Object in the middle.
[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary
Previously, the function would just return a presumed filename. Now that
it actually checks s3, if the file doesn't exist it'll raise an
exception. By default that's a StopIteration at the end of the bucket
iterator, which isn't ideal as this will get supressed if the function
is called within a generator loop further up or anything.
There are a couple of places where we expect the file may not exist, so
we define a custom exception to rescue specifically here. I did consider
subclassing boto's ClientError, but this wasn't straightforward as the
constructor expects to know the operation that failed, which for me is a
signal that it's not an appropriate (re-)use of the class.
Previously we generated the filename we expected a letter PDF to be
stored at in S3, and used that to retrieve it. However, the generated
filename can change over the course of a notification's lifetime e.g.
if the service changes from crown ('.C.') to non-crown ('.N.').
The prefix of the filename is stable: it's based on properties of the
notification - reference and creation - that don't change. This commit
changes the way we interact with letter PDFs in S3:
- Uploading uses the original method to generate the full file name.
The method is renamed to 'generate_' to distinguish it from the new one.
- Downloading uses a new 'find_' method to get the filename using just
its prefix, which makes it agnostic to changes in the filename suffix.
Making this change helps to decouple our code from the requirements DVLA
have on the filenames. While it means more traffic to S3, we rely on S3
in any case to download the files. From experience, we know S3 is highly
reliable and performant, so don't anticipate any issues.
In the tests we favour using moto to mock S3, so that the behaviour is
realistic. There are a couple of places where we just mock the method,
since what it returns isn't important for the test.
Note that, since the new method requires a notification object, we need
to change a query in one place, the columns of which were only selected
to appease the original method to generate a filename.
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats
We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
We current do this as part of send-daily-performance-platform-stats but
now this moves it into its own separate task. This is for two reasons
- we will shortly get rid of the send-daily-performance-platform-stats
task as we no longer will need to send anything to performance
platform
- even if we did decide to keep the task
send-daily-performance-platform-stats and remove the specific bits
that relate to the performance platform, it's probably nicer to
rewrite the new task from scratch to make sure it's all clear and easy
to understand
All other tasks in app/celery/*_tasks.py have timers on them. Some
of these timers will be useful to check before/after performance as
a way to reassure ourselves about the impact of [1].
[1]: https://github.com/alphagov/notifications-api/pull/3172
previously we would retry if the task was queued up for retry but the
status is in "received-ack" or "received-err". We don't expect that a
task will be retried after getting this status, but if there are
duplicate tasks that could happen. Lets plan for the worst by saying
"only process a retry if the task is currently in sending".
this way, if a duplicate task is on retry and the first task goes
through succesfully, the duplicate task will give up.
This doesn't just relate to precompiled letters, it's actually just
checking that there are not any letters still waiting for a virus check
that should not be. This change to the naming makes it more accurate
and therefore easy to understand
This doesn't just relate to templated letters, it's actually just
checking that there are not any letters still in created that should not
be. This change to the naming makes it more accurate and therefore easy
to understand
At the moment, if a service callback fails, it will get put on the retry queue.
This causes a potential problem though:
If a service's callback server goes down, we may generate a lot of retries and
this may then put a lot of items on the retry queue. The retry queue is also
responsible for other important parts of Notify such as retrying message
delivery and we don't want a service's callback server going down to have an
impact on the rest of Notify.
Putting the retries on a different queue means that tasks get processed
faster than if they were put back on the same 'service-callbacks' queue.
### The facts
* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
seconds, sqs assumes the celery app has died, and puts it back on the
queue.
### The situation
A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.
(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)
### The fix
#### Set the max retry delay for broadcast tasks to 240 seconds
Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.
#### Set the prefetch limit to 1 for broadcast workers
This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.
Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.
https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveatshttps://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
previously if we were deciding whether to retry or not, it meant that
future events wouldn't have context of what the task is doing. We'd
run into issues with not knowing what references to include when
updating/cancelling in future events.
Instead of deciding whether to retry or not, always retry. Instead, when
any event sends, regardless of whether it is a first time or a retry,
check the status of previous events for that broadcast message. There
are a few things that will mean we don't send.
* If the finishes_at time has already elapsed (ie: we have been trying
to resend this message and haven't had any luck and now the data is
obselete)
* A previous event has no provider message (this means that we never
picked the previous event off the queue for some reason)
* A previous event has a provider message that has anything other than
an ack response. This includes sending (the old message is currently
being sent), and technical-failure/returned-error (the old message is
currently in the retry loop, having experienced issues).