This is happening on the AWS side now as part of
alphagov/notifications-broadcasts-infra#267 - but we still want to keep
the zendesk ticket as it contains useful context _and_ provides
visibility to the team.
Broadcasts created via the API [1] and the Admin app [2] should
both now have this field set. It's also more informative to show
this, and broadcasts created via the API don't have IDs anyway.
There's a small risk that an old broadcast that gets approved won't
have this data, but it's for information only and we intend to
backfill all old broadcasts in the near future.
[1]: 023a06d5fb
[2]: 7dbe3afa19
This is necessary until:
- The Admin app is using the new "areas(_2)" format to store and
retrieve data.
- We've migrated all existing broadcast messages to use the new
format.
Note that "areas" / "ids" isn't actually used for anything except
printing out the PagerDuty message - it's not sent to the proxy [1].
[1]: 6edc6c70aa/app/celery/broadcast_message_tasks.py (L190-L193)
This modifies the previous "(_)send_link_test" method to trigger a
link test for a specific lambda. We then call the method with both
the primary and failover lambda in new orchestrator method.
Since the _invoke_lambda function doesn't raise exceptions if it
fails, there's no need to rescue anything in order to ensure the
second link test / invocation runs as well. It doesn't testing for
this, since it boils to an absence of code to raise any exception.
Note that, like the other parent tests, we only check the new method
works with a specific proxy client instance.
Unlike the other IDs which are stored in the DB, this isn't relevant
for the Celery task as it invokes a link test. Moving it into the
proxy client will also enable us to generate a second ID in the next
commits, where we start doing a link test for the failover lambda.
Previously the Celery task to trigger a link test had to know about
the special case of a sequence number for Vodafone. Since we're about
to change the client to perform multiple tests it makes sense to give
it the knowledge of how to generate number itself.
Note that we have to import the db inline to avoid a circular import,
since this module is itself imported by app/__init__.py.
Other invocations of the Vodafone client use stored sequence numbers
from the DB, which are called "message numbers" in that context. Since
the two use cases are very different (even the names are different!),
having them in two places shouldn't cause any confusion.
Many of the team members do not look at emails from zendesk, adding a current_app.logger.error message for things we care about to give developers a better chance of seeing them.
I have purposely not added an erro log for `check_for_services_with_high_failure_rates_or_sending_to_tv_numbers` because it's not something we need to look at immediately.
This ensures that the log messages both contain broadcast_event id and
broadcast_provider_message id. It also removes the broadcast_event
reference since this isn't particularly useful in helping to find an
event.
It wasn't clear what the ID in the message was. It's not possible to add
more details to the message - we don't create a broadcast message or
event for a link test.
While both of these are integrity errors (since we should never
reach this point in the code + data), this just means the original
method comment is still relevant to what immediately follows it.
This mirrors the check we do for jobs, which are also a high-impact
task [1]. While this shouldn't be possible, just like other checks
we're adding it here to be doubly certain.
[1]: 3d71815956/app/celery/tasks.py (L74)
We only actually use this when the data we're working with is in an
unexpected state, which is unrelated to the CBC Proxy. Using this
name also means we can re-use this exception in the next commits.
Note that we may still care if a broadcast message has expired, since
it's not expected that someone would send one in this condition.
This change will make our development environments closer to production
even if they aren't hooked up to the CBC proxy lambda functions.
Now in development, we will create the broadcast event and create tasks
for each broadcast provider event. We will still not create actual
broadcast provider message rows in the DB and talk to the CBC proxies.
This should be helpful in development to catch any issues we introduce
to do with sending broadcast messaging. In time we may wish to have some
fake CBC proxies in the AWS tools account that we can interact with to
make it even more realistic.
config['NOTIFY_ENVIRONMENT'] is hardcoded to `'live'` in the Live config
class. The values as seen on the environment which we send real messages
from:
```
>>> json.loads(os.environ['VCAP_APPLICATION'])['space_name'] # what cloudfoundry sets
'production'
>>> os.environ['NOTIFY_ENVIRONMENT'] # we set this from cloudfoundry
'production'
>>> current_app.config['NOTIFY_ENVIRONMENT'] # hardcoded in the Live config
'live'
>>> current_app.config['NOTIFICATION_QUEUE_PREFIX'] # pulled from env var of same name
'live'
>>> current_app.config['ENV'] # this is an unrelated flask variable
'production'
```
it's important to keep tabs on when these things leave our system.
Sending a zendesk ticket that triggers a P1 is probably our simplest way
of notifying the team when this happens (it's what we do with out of
hours emergencies on the admin app too). We don't have any direct
pagerduty integrations from the api app, but we already have the zendesk
client hooked up.
After broadcasts go live, we may want to change this to a P2 (but even
then, there's arguments for keeping it P1 to start with I think).
Don't cause a P1 if it goes out on staging as that might be MNOs testing.
previously we would retry if the task was queued up for retry but the
status is in "received-ack" or "received-err". We don't expect that a
task will be retried after getting this status, but if there are
duplicate tasks that could happen. Lets plan for the worst by saying
"only process a retry if the task is currently in sending".
this way, if a duplicate task is on retry and the first task goes
through succesfully, the duplicate task will give up.
### The facts
* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
seconds, sqs assumes the celery app has died, and puts it back on the
queue.
### The situation
A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.
(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)
### The fix
#### Set the max retry delay for broadcast tasks to 240 seconds
Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.
#### Set the prefetch limit to 1 for broadcast workers
This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.
Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.
https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveatshttps://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
previously if we were deciding whether to retry or not, it meant that
future events wouldn't have context of what the task is doing. We'd
run into issues with not knowing what references to include when
updating/cancelling in future events.
Instead of deciding whether to retry or not, always retry. Instead, when
any event sends, regardless of whether it is a first time or a retry,
check the status of previous events for that broadcast message. There
are a few things that will mean we don't send.
* If the finishes_at time has already elapsed (ie: we have been trying
to resend this message and haven't had any luck and now the data is
obselete)
* A previous event has no provider message (this means that we never
picked the previous event off the queue for some reason)
* A previous event has a provider message that has anything other than
an ack response. This includes sending (the old message is currently
being sent), and technical-failure/returned-error (the old message is
currently in the retry loop, having experienced issues).
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:
Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.
Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...
...UNLESS WE DO!
Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.
Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
This falls back to the "test" channel if they do not have a
ServiceBroadcastSetting for the moment, but we intend in future PRs to
enforce that all broadcast services will have this property.
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`
We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
previously we made some incorrect assumptions about set-up on staging
and prod - they currently don't have any cbc_proxy aws creds at all.
We shoudn't be attempting canaries or link tests when there's no AWS
infrastructure to connect to.
We also shouldn't bother writing a row into the database at all for the
broadcast_provider_message since we're not even attempting to send, and
we shouldn't get confused between messages that failed and messages we
never wanted to send at all.
this is a pretty big and convoluted refactor unfortunately.
Previously:
There was one global `cbc_proxy_client` object in apps. This class has
the information about how to invoke the bt-ee lambda, and handles all
calls to lambda. This includes calls to the canary too (which is a
separate lambda).
The future:
There's one global `cbc_proxy_client`. This knows about the different
provider functions and lambdas, and you'll need to ask this client for a
proxy for your chosen provider. call cbc_proxy_client.get_proxy('ee')`
and it'll return you a proxy that knows what ee's lambda function is,
how to transform any content in a way that is exclusive to ee, and in
future how to parse any response from ee.
The present:
I also cleaned up some duplicate tests.
I'm really not sure about the names of some of these variables - in
particular `cbc_proxy_client` isn't a client - it's more of a java style
factory, where you call a function on it to get the client of your
choice.