Commit Graph

51 Commits

Author SHA1 Message Date
Ben Thorner
0070473f31 Check for suspension before sending a broadcast
This mirrors the check we do for jobs, which are also a high-impact
task [1]. While this shouldn't be possible, just like other checks
we're adding it here to be doubly certain.

[1]: 3d71815956/app/celery/tasks.py (L74)
2021-04-19 17:13:12 +01:00
Ben Thorner
b2398fcaf4 Rename CBCProxyFatalException
We only actually use this when the data we're working with is in an
unexpected state, which is unrelated to the CBC Proxy. Using this
name also means we can re-use this exception in the next commits.

Note that we may still care if a broadcast message has expired, since
it's not expected that someone would send one in this condition.
2021-04-19 17:13:05 +01:00
Ben Thorner
f85dad5acf Merge pull request #3203 from alphagov/remove-statsd-decorators
Remove redundant @statsd timing decorators
2021-04-14 10:04:04 +01:00
David McDonald
295162c81d Move CBC proxy enable check
This change will make our development environments closer to production
even if they aren't hooked up to the CBC proxy lambda functions.

Now in development, we will create the broadcast event and create tasks
for each broadcast provider event. We will still not create actual
broadcast provider message rows in the DB and talk to the CBC proxies.

This should be helpful in development to catch any issues we introduce
to do with sending broadcast messaging. In time we may wish to have some
fake CBC proxies in the AWS tools account that we can interact with to
make it even more realistic.
2021-04-12 17:05:41 +01:00
Ben Thorner
e3e067c795 Remove redundant @statsd timing decorators
These are superseded by timing task execution generically in the
NotifyTask superclass [1]. Note that we need to wait until we've
gathered enough data under the new metrics before removing these.

[1]: https://github.com/alphagov/notifications-api/pull/3201#pullrequestreview-633549376
2021-04-12 15:19:18 +01:00
Leo Hemsted
4a5b1c23bd only send zendesk P1 for alerts
we don't need to be re-notified when someone clicks cancel
2021-04-08 12:22:18 +01:00
Leo Hemsted
9bd8c0239c look for 'live', not 'production'
config['NOTIFY_ENVIRONMENT'] is hardcoded to `'live'` in the Live config
class. The values as seen on the environment which we send real messages
from:

```
>>> json.loads(os.environ['VCAP_APPLICATION'])['space_name']  # what cloudfoundry sets
'production'
>>> os.environ['NOTIFY_ENVIRONMENT']  # we set this from cloudfoundry
'production'
>>> current_app.config['NOTIFY_ENVIRONMENT']  # hardcoded in the Live config
'live'
>>> current_app.config['NOTIFICATION_QUEUE_PREFIX']  # pulled from env var of same name
'live'
>>> current_app.config['ENV']  # this is an unrelated flask variable
'production'
```
2021-04-08 12:17:22 +01:00
Leo Hemsted
df393e36c5 send a p1 when a broadcast goes out on production
it's important to keep tabs on when these things leave our system.
Sending a zendesk ticket that triggers a P1 is probably our simplest way
of notifying the team when this happens (it's what we do with out of
hours emergencies on the admin app too). We don't have any direct
pagerduty integrations from the api app, but we already have the zendesk
client hooked up.

After broadcasts go live, we may want to change this to a P2 (but even
then, there's arguments for keeping it P1 to start with I think).

Don't cause a P1 if it goes out on staging as that might be MNOs testing.
2021-04-06 11:32:19 +01:00
Ben Thorner
a91fde2fda Run auto-correct on app/ and tests/ 2021-03-12 11:45:45 +00:00
David McDonald
3ea86bfb48 Remove hardcoded default to use test channel
There is no need for a default now as every broadcast service has set on
it which broadcast channel to use.
2021-02-23 17:15:07 +00:00
Leo Hemsted
0088bcd98b only retry if the broadcast message task is in sending
previously we would retry if the task was queued up for retry but the
status is in "received-ack" or "received-err". We don't expect that a
task will be retried after getting this status, but if there are
duplicate tasks that could happen. Lets plan for the worst by saying
"only process a retry if the task is currently in sending".

this way, if a duplicate task is on retry and the first task goes
through succesfully, the duplicate task will give up.
2021-02-18 12:03:36 +00:00
Katie Smith
6b8ebb3421 Fix linting errors 2021-02-16 09:03:38 +00:00
Leo Hemsted
4f89be6944 Revert "Merge pull request #3125 from alphagov/revert-retry"
This reverts commit 6b9a50beff, reversing
changes made to 33f93dfea2.
2021-02-09 17:01:04 +00:00
Leo Hemsted
bee0059e53 Revert "Merge pull request #3101 from alphagov/retry-broadcasts"
This reverts commit 1bd99c779d, reversing
changes made to d390eb2cac.
2021-02-08 11:02:34 +00:00
Leo Hemsted
49e6ec1ead Revert "Merge pull request #3123 from alphagov/retry-loop-fix"
This reverts commit 541a765811, reversing
changes made to 6a9ac654a6.
2021-02-08 11:01:33 +00:00
Leo Hemsted
d582e35471 dont try and send broadccast event if it's already in technical-failure
this gives us an option to manually set a status in the database and
avoid things being stuck in a retry loop forever
2021-02-05 12:52:37 +00:00
Leo Hemsted
0ddebc63a8 reduce broadcast retry delay to 4 mins and drop prefetch.
### The facts

* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
  that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
  seconds, sqs assumes the celery app has died, and puts it back on the
  queue.

### The situation

A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.

(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)

### The fix

#### Set the max retry delay for broadcast tasks to 240 seconds

Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.

#### Set the prefetch limit to 1 for broadcast workers

This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.

Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.

https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
2021-02-05 12:49:51 +00:00
Leo Hemsted
bbae209200 check provider message status etc when sending rather than when retrying
previously if we were deciding whether to retry or not, it meant that
future events wouldn't have context of what the task is doing. We'd
run into issues with not knowing what references to include when
updating/cancelling in future events.

Instead of deciding whether to retry or not, always retry. Instead, when
any event sends, regardless of whether it is a first time or a retry,
check the status of previous events for that broadcast message. There
are a few things that will mean we don't send.

* If the finishes_at time has already elapsed (ie: we have been trying
  to resend this message and haven't had any luck and now the data is
  obselete)
* A previous event has no provider message (this means that we never
  picked the previous event off the queue for some reason)
* A previous event has a provider message that has anything other than
  an ack response. This includes sending (the old message is currently
  being sent), and technical-failure/returned-error (the old message is
  currently in the retry loop, having experienced issues).
2021-02-03 18:11:52 +00:00
Leo Hemsted
96a0935d1c update broadcast provider message status on success/error
so we can distinguish errorring messages that are currently retrying
from those that sent succesfully.
2021-02-03 18:03:16 +00:00
Leo Hemsted
3dcbfc3612 re-use existing provider message if task retries
previously it would crash with a unique constraint error. now, grab the
previous message.
2021-02-03 18:01:54 +00:00
Leo Hemsted
ac34fb9c05 retry sending broadcasts
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:

Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.

Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...

...UNLESS WE DO!

Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.

Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
2021-02-03 16:43:01 +00:00
David McDonald
f441d5b4ce Add comment about service channels for updating 2021-02-03 11:37:02 +00:00
David McDonald
f90b479c8d Use service setting to pick broadcast channel
This falls back to the "test" channel if they do not have a
ServiceBroadcastSetting for the moment, but we intend in future PRs to
enforce that all broadcast services will have this property.
2021-02-01 14:10:41 +00:00
David McDonald
2aad3163e6 Allow CBC proxy client to take channel
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`

We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
2021-02-01 14:10:38 +00:00
David McDonald
20627d96ea Put all broadcast tasks on the broadcast worker 2021-01-13 17:21:40 +00:00
Pea Tyczynska
45b806f6db Remove unused args from cancel broadcast call in tasks 2020-12-14 11:31:05 +00:00
Pea Tyczynska
def7a16765 Establish relation between provider message and message number
this is so we can access brodcast_provider_message_number from
BroadcastProviderMessage object
2020-12-09 11:41:22 +00:00
Pea Tyczynska
8af4b27fd6 Separate functions for cbc clients
Also move message_format to the clients.
2020-12-09 11:13:50 +00:00
Pea Tyczynska
553565bc91 Send message format to CBC
Either cap or ibag
2020-12-08 11:15:26 +00:00
Pea Tyczynska
2952b70930 Only create sequential numbers for Vodafone messages 2020-12-07 13:13:13 +00:00
Pea Tyczynska
e95dc9450e Include message number in send_broadcast_provider_message 2020-12-07 13:13:12 +00:00
Pea Tyczynska
a186d2d296 Format sequential number into an 8 char long hex
As per Vodafone spec for ibag format message number
2020-12-07 13:13:11 +00:00
Pea Tyczynska
b34bffaae6 Sends sequential number to Vodafone as link test 2020-12-07 13:13:11 +00:00
Leo Hemsted
fd335e3d8b move available provider logic to the service model
make sure it's in an accessible place so we don't end up duplicating our
work
2020-12-03 22:50:50 +00:00
Leo Hemsted
72f8a15d4f respect service broadcast provider restrictions when sending 2020-12-03 13:39:09 +00:00
Leo Hemsted
e2fa0116a0 add CBC_PROXY_ENABLED config flag to control if tasks are triggered
previously we made some incorrect assumptions about set-up on staging
and prod - they currently don't have any cbc_proxy aws creds at all.

We shoudn't be attempting canaries or link tests when there's no AWS
infrastructure to connect to.

We also shouldn't bother writing a row into the database at all for the
broadcast_provider_message since we're not even attempting to send, and
we shouldn't get confused between messages that failed and messages we
never wanted to send at all.
2020-11-26 10:16:22 +00:00
Leo Hemsted
087cc5053d separate cbc proxy into separate clients
this is a pretty big and convoluted refactor unfortunately.

Previously:

There was one global `cbc_proxy_client` object in apps. This class has
the information about how to invoke the bt-ee lambda, and handles all
calls to lambda. This includes calls to the canary too (which is a
separate lambda).

The future:

There's one global `cbc_proxy_client`. This knows about the different
provider functions and lambdas, and you'll need to ask this client for a
proxy for your chosen provider. call cbc_proxy_client.get_proxy('ee')`
and it'll return you a proxy that knows what ee's lambda function is,
how to transform any content in a way that is exclusive to ee, and in
future how to parse any response from ee.

The present:

I also cleaned up some duplicate tests.
I'm really not sure about the names of some of these variables - in
particular `cbc_proxy_client` isn't a client - it's more of a java style
factory, where you call a function on it to get the client of your
choice.
2020-11-19 15:50:37 +00:00
Leo Hemsted
0257774cfa add get_earlier_provider_message fn to broadcast_event
replacing get_earlier_provider_messages. The old function returned the
previous references for earlier events for a broadcast_message. However,
these depend on the message sent to a specific provider, so the function
needs to change. It now takes in a provider, and only returns
broadcast_provider_messages sent to that provider. If there are earlier
broadcast_events without a provider_message for the chosen provider, it
raises an exception - you cannot cancel a message if all the previous
events have not been created properly (as we wouldn't know what
references to cancel).
2020-11-19 15:50:37 +00:00
Leo Hemsted
f12c949ae9 create broadcast_provider_message and use id from that instead
(instead of using the id from broadcast_event)

we need every XML blob we send to have a different ID. if we're sending
different XML blobs for each provider, then each one should have a
different identifier. So, instead of taking the identifier from the
broadcast_event, take it from the broadcast_provider_message instead.

Note: We're still going to the broadcast_event for most fields, to
ensure they stay consistent between different providers. The last thing
we want is for different phone networks to get different content
2020-11-19 15:50:37 +00:00
Leo Hemsted
7cc83e04eb move BroadcastProvider from models.py to config.py
It's not something that is tied to a database table, and was causing
circular import issues
2020-11-19 15:50:37 +00:00
Leo Hemsted
bc3512467b send messages to multiple providers
at the moment only EE is enabled (this is set in app.config, but also,
only EE have a function defined for them so even if another provider was
enabled without changing the dict in cbc_proxy.py we won't trigger
anything). this commit just adds wrapper tasks that check what providers
are enabled, and invokes the send function for each provider.

The send function doesn't currently distinguish between providers for
now - as we only have EE set up. in the future we'll want to separate
the cbc_proxy_client into separate clients for separate providers.
Different providers have different lambda functions, and have different
requirements. For example, we know that the two different CBC software
solutions handle references to previous messages differently.
2020-11-19 15:50:37 +00:00
Toby Lorne
dd012d6831 client: cbc_proxy passes through sent/expires
A BroadcastEvent knows when an event was sent and should expire

We pass through these values directly to the CBC Proxy, because
BroadcastEvent knows how they should be formatted

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
2020-10-28 11:37:06 +00:00
Toby Lorne
d22adb7813 broadcasts: do not send description to cbc_proxy
"areas" and "simple_polygons" in "transmitted_areas" do not have the
same length

as an example, choosing the area "england" results in a single item in
"areas" but many polygons in "simple_polygons"

therefore zipping these two together gives a list of areas:
* of length 1
* containing only new grimsby

which is not what we want

as the CBC does not care about the areaDesc field within CAP, we should
omit it from the function invocation and delegate the contents of
areaDesc to the CBC Proxy implementation

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Richard <richard.baker@digital.cabinet-office.gov.uk>
Co-authored-by: David <david.mcdonald@digital.cabinet-office.gov.uk>
2020-10-26 15:16:33 +00:00
Toby Lorne
fdacb2e0d7 broadcasts: remove references to cbc stub
We are phasing out our cbc-proxy stub which displayed CAP XML messages

We are in the process of testing with real CBCs, so maintaining our own
stub is not useful

This commit
* removes the HTTP POST requests to the CBC proxy
* writes up the update/cancel methods of the cbc_client (not impl)

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
2020-10-26 10:23:49 +00:00
Toby Lorne
aa002afd31 clients: cbc_proxy actions accepts areas param
related:
https://github.com/alphagov/notifications-broadcasts-infra/pull/23

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
2020-10-23 17:09:00 +01:00
Toby Lorne
2cc0a65851 celery: broadcast msg create invokes cbc proxy
When we create a broadcast message, we should invoke the cbc proxy to
send a cap message

Either a function will be invoked within AWS, or a noop function call
is made, depending on the environment

We have only implemented CB message creation in the CBC Proxy, without
polygons, therefore we:
* only invoke the CBC Proxy during message creation
* only send description, identifier, and hard-coded headline

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
Co-authored-by: Katie <katie.smith@digital.cabinet-office.gov.uk>
2020-10-20 11:57:26 +01:00
David McDonald
5b2dee5ddb Bump utils to 42.0.0
Requires unit test updating as we now expect broadcast event areas to
be a dict containing a list of areas and simple polygons.
2020-09-14 15:21:55 +01:00
David McDonald
d352c99142 Remove unused send_broadcast_message task
We only call send_broadcast_event now
2020-09-14 15:16:59 +01:00
Leo Hemsted
bdf2253298 send broadcast events rather than messages
use the new endpoint from cbc proxy. create a new task that just
serializes the event and sends it across rather than sending a template
and the broadcast message.

some changes to serialize to make it json friendly etc. it also expects
sent_at and transmitted_finishes_at to always be set (we set them in the
code but don't enforce it n the DB right now), as they're required by
utils template. not sure whether we'll update db constraints to be more
strict or utils template to be more permissive just yet, wait until we
find out more about the requirements of the CBCs we integrate with.
2020-08-14 17:41:44 +01:00
Leo Hemsted
8b4a954f2b move schema import in to function level
solves `AttributeError: 'DummySession' object has no attribute 'query'`

if you don't do this you get really hard to diagnose errors in unrelated
tests, due to strange import order problems or something
2020-07-09 20:10:34 +01:00