### The facts
* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
seconds, sqs assumes the celery app has died, and puts it back on the
queue.
### The situation
A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.
(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)
### The fix
#### Set the max retry delay for broadcast tasks to 240 seconds
Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.
#### Set the prefetch limit to 1 for broadcast workers
This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.
Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.
https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveatshttps://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
otherwise we run into issues where we dont issue the cancel as we say
"oh look the expiry time just passed, so we shouldnt send this message
as it's already been removed from the cbc".
previously if we were deciding whether to retry or not, it meant that
future events wouldn't have context of what the task is doing. We'd
run into issues with not knowing what references to include when
updating/cancelling in future events.
Instead of deciding whether to retry or not, always retry. Instead, when
any event sends, regardless of whether it is a first time or a retry,
check the status of previous events for that broadcast message. There
are a few things that will mean we don't send.
* If the finishes_at time has already elapsed (ie: we have been trying
to resend this message and haven't had any luck and now the data is
obselete)
* A previous event has no provider message (this means that we never
picked the previous event off the queue for some reason)
* A previous event has a provider message that has anything other than
an ack response. This includes sending (the old message is currently
being sent), and technical-failure/returned-error (the old message is
currently in the retry loop, having experienced issues).
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:
Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.
Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...
...UNLESS WE DO!
Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.
Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
We already trigger a zendesk ticket for these two cases, meaning that
whenever we get this situation, we get 3 emails. One for the zendesk
ticket, one from logit raising the fact an exception was raised and one
from cloudwatch raising the fact an exception was raised.
We don't need all these emails, a zendesk ticket is sufficient.
Downgrading to a warning means this event will still be findable in our
logs however.
This falls back to the "test" channel if they do not have a
ServiceBroadcastSetting for the moment, but we intend in future PRs to
enforce that all broadcast services will have this property.
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`
We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
This will allow us to store details of which channel a service should be
sending to.
See the comment about how all broadcast services can have a row in the
table but may not at the moment. This has been done for speed as it's
the quickest way to let us set up different services to send to
different channels for some needed testing with the mobile handset
providers in the coming week.
This used to be hardcoded in the CBC proxy but now we will hardcode it
in the cbc_proxy_client.
In a future PR we can start choosing which channel a broadcast will go
to based on the channel configured for that broadcast service.
We’re going to let people pass in fairly complex polygons, but:
- we don’t want to store massive polygons
- we don’t want to pass the CBCs massive polygons
So this commit adds a step to simplify the polygons before storing them.
We think it’s best for us to do this because:
- writing code to do polygon simplification is non-trivial, and we don’t
want to make all potential integrators do it
- the simplification we’ve developed is domain-specific to emergency
alerting, so should throw away less information than
There’s a bit more detail about how we simplify polygons in
https://github.com/alphagov/notifications-admin/pull/3590/files
This gives us some extra confidence that there aren’t any problems with
the data we’re getting from the other service. It doesn’t address any
specific problems we’ve seen, rather it seems like a sensible precaution
to take.
This commit makes the existing endpoint also accept CAP XML, should the
appropriate `Content-Type` header be set.
It uses the translation code we added in a previous commit to convert
the CAP to a dict. We can then validate that dict against with the JSON
schema to ensure it’s something we can work with.
We know there is at least one system which wants to integrate with
Notify to send out emergency alerts, rather than creating them manually.
This commit adds an endpoint to the public API to let them do that.
To start with we’ll just let the system create them in a single call,
meaning they still have to be approved manually. This reduces the risk
of an attacker being able to broadcast an alert via the API, should the
other system be compromised.
We’ve worked with the owners of the other system to define which fields
we should care about initially.
We want to rename the `bt-ee-1-proxy` lambda function to `ee-1-proxy`.
This change will need to be deployed at the same time that we change
the name of the lambda function in the Terraform.
o2 use One-2-many CBC so we can use the O2M/CAP client.
Once differences between CBCs have been worked out we can consolidate O2M clients to reduce duplication.
Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
So:
'billing_contact_email_address' becomes 'billing_contact_email_addresses'
AND
'billing_contact_name' becomes 'billing_contact_names'
This is to signify that each of those fields can contain numerous
items
This has been added in for speed of development but now we are getting
close to integrating with production systems, we will be turning off
these helpful hacks to reduce the risk of someone sending a real
broadcast to citizens.
Note, platforms are still able to approve broadcasts when their service
is in training mode.
When we send an HTTP request to our SMS providers, there is a
chance we get a 5xx status code back from them. Currently we log this as
two different exception level logs.
If a provider has a funny few minutes, we could end up with
hundreds of exceptions thrown and pagerduty waking someone up in the
middle of the night. These problems tend to pretty quickly fix
themselves as we balance traffic from one SMS to the other SMS provider
within 5 minutes.
By downgrading both exceptions to warning in the case of a
`SmsClientResponseException`, we will reduce the change of waking us up
in the middle of the night for no reason.
If the error is not a `SmsClientResponseException`, then we will still
log at the exception level as before as this is more unexpected and we
may want to be alerted sooner.
What we still want to happen though is that let's say both SMS providers
went down at the same time for 1 hour. We don't want our tasks to just
sit there, retrying every 5 minutes for the whole time without us being
aware (so we can at least raise a statuspage update). Luckily we will
still be alerted because our smoke tests will fail after 10 minutes and
raise a p1:
https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21
These tasks need to repeatedly get the same template and service from
the database. We should be able to improve their performance by getting
the template and service from the cache instead, like we do in the REST
endpoint code.
In the admin app we need something to use in show in lieu of template
name when a template isn’t used. Let’s store this in the reference field
for now.
For all FunctionErrors, and for invoke errors (status > 299) we
want to retry with failover lambda.
We are doing this, because if there is a connection or other error
with one lambda, the failover lambda may still work and it's
worth trying.
With time, we will probably have more complex retry flow, depending
on the error and even maybe differing for each MNO (broadcast provider).