Commit Graph

3821 Commits

Author SHA1 Message Date
Leo Hemsted
d582e35471 dont try and send broadccast event if it's already in technical-failure
this gives us an option to manually set a status in the database and
avoid things being stuck in a retry loop forever
2021-02-05 12:52:37 +00:00
Leo Hemsted
0ddebc63a8 reduce broadcast retry delay to 4 mins and drop prefetch.
### The facts

* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
  that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
  seconds, sqs assumes the celery app has died, and puts it back on the
  queue.

### The situation

A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.

(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)

### The fix

#### Set the max retry delay for broadcast tasks to 240 seconds

Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.

#### Set the prefetch limit to 1 for broadcast workers

This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.

Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.

https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
2021-02-05 12:49:51 +00:00
Leo Hemsted
eff0119f5c dont update finishes at when cancelling broadcast
otherwise we run into issues where we dont issue the cancel as we say
"oh look the expiry time just passed, so we shouldnt send this message
as it's already been removed from the cbc".
2021-02-04 14:25:38 +00:00
Leo Hemsted
1ef3f96bd7 test sending broadcast message for all statuses of existing provider_msg
also clean up some comments
2021-02-04 11:50:06 +00:00
Leo Hemsted
bbae209200 check provider message status etc when sending rather than when retrying
previously if we were deciding whether to retry or not, it meant that
future events wouldn't have context of what the task is doing. We'd
run into issues with not knowing what references to include when
updating/cancelling in future events.

Instead of deciding whether to retry or not, always retry. Instead, when
any event sends, regardless of whether it is a first time or a retry,
check the status of previous events for that broadcast message. There
are a few things that will mean we don't send.

* If the finishes_at time has already elapsed (ie: we have been trying
  to resend this message and haven't had any luck and now the data is
  obselete)
* A previous event has no provider message (this means that we never
  picked the previous event off the queue for some reason)
* A previous event has a provider message that has anything other than
  an ack response. This includes sending (the old message is currently
  being sent), and technical-failure/returned-error (the old message is
  currently in the retry loop, having experienced issues).
2021-02-03 18:11:52 +00:00
Leo Hemsted
96a0935d1c update broadcast provider message status on success/error
so we can distinguish errorring messages that are currently retrying
from those that sent succesfully.
2021-02-03 18:03:16 +00:00
Leo Hemsted
3dcbfc3612 re-use existing provider message if task retries
previously it would crash with a unique constraint error. now, grab the
previous message.
2021-02-03 18:01:54 +00:00
Leo Hemsted
ac34fb9c05 retry sending broadcasts
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:

Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.

Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...

...UNLESS WE DO!

Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.

Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
2021-02-03 16:43:01 +00:00
David McDonald
d390eb2cac Merge pull request #3112 from alphagov/channel-restriction
Set broadcast channel as a service setting
2021-02-03 11:46:04 +00:00
David McDonald
070b79c27e Downgrade exceptions to warnings to reduce emails
We already trigger a zendesk ticket for these two cases, meaning that
whenever we get this situation, we get 3 emails. One for the zendesk
ticket, one from logit raising the fact an exception was raised and one
from cloudwatch raising the fact an exception was raised.

We don't need all these emails, a zendesk ticket is sufficient.
Downgrading to a warning means this event will still be findable in our
logs however.
2021-02-02 15:10:26 +00:00
David McDonald
f90b479c8d Use service setting to pick broadcast channel
This falls back to the "test" channel if they do not have a
ServiceBroadcastSetting for the moment, but we intend in future PRs to
enforce that all broadcast services will have this property.
2021-02-01 14:10:41 +00:00
David McDonald
b2ed9efe85 Extend test cases for every MNO
Seems to be no harm in extending these for every mobile network just to
give us slightly better coverage
2021-02-01 14:10:40 +00:00
David McDonald
2aad3163e6 Allow CBC proxy client to take channel
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`

We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
2021-02-01 14:10:38 +00:00
David McDonald
91f5be835a Add DB table for service broadcast settings
This will allow us to store details of which channel a service should be
sending to.

See the comment about how all broadcast services can have a row in the
table but may not at the moment. This has been done for speed as it's
the quickest way to let us set up different services to send to
different channels for some needed testing with the mobile handset
providers in the coming week.
2021-02-01 14:10:37 +00:00
David McDonald
a3d966056a Merge pull request #3110 from alphagov/test-channel
Set the default broadcast channel to test
2021-02-01 10:18:35 +00:00
Katie Smith
fc9ecaba1d Bump utils to 43.8.1
This brings in the change to stop TV numbers from being treated as
international numbers.
2021-01-29 15:53:35 +00:00
David McDonald
86ea89cf76 Merge pull request #3098 from alphagov/downgrade-to-warning
Downgrade SMS provider request exceptions to warnings
2021-01-29 11:52:10 +00:00
Chris Hill-Scott
837e464081 Merge pull request #3060 from alphagov/add-broadcast-api-endpoint
Add public API endpoint to create broadcast messages
2021-01-28 12:59:41 +00:00
Pea Tyczynska
51c0ece130 Merge pull request #3108 from alphagov/stub-training-broadcasts
Stub training broadcasts
2021-01-28 11:58:47 +00:00
Katie Smith
4ed79dae48 Merge pull request #3105 from alphagov/rename-bt-ee
Rename bt-ee-proxy to ee-proxy
2021-01-27 17:05:26 +00:00
David McDonald
f9b1d3d573 Set the default broadcast channel to test
This used to be hardcoded in the CBC proxy but now we will hardcode it
in the cbc_proxy_client.

In a future PR we can start choosing which channel a broadcast will go
to based on the channel configured for that broadcast service.
2021-01-27 15:27:11 +00:00
Pea Tyczynska
d4cc250510 Don't create broadcast provider messages for stubbed broadcasts 2021-01-27 10:20:44 +00:00
Pea Tyczynska
26d6b4a958 Mark broadcast message as stubbed when sent from training account 2021-01-27 10:20:43 +00:00
Chris Hill-Scott
0398ac57f1 Use correct HTTP status code for bad content type
415 is the status code for ‘Unsupported media type’

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/415
2021-01-26 16:24:45 +00:00
Chris Hill-Scott
b85fcafd46 Don’t allow broadcasts to be created from JSON
Until we know we’re going to have real users for this, let’s not expose
it.
2021-01-26 16:24:45 +00:00
Chris Hill-Scott
c9d55039eb Simplify polygons before storing them
We’re going to let people pass in fairly complex polygons, but:
- we don’t want to store massive polygons
- we don’t want to pass the CBCs massive polygons

So this commit adds a step to simplify the polygons before storing them.

We think it’s best for us to do this because:
- writing code to do polygon simplification is non-trivial, and we don’t
  want to make all potential integrators do it
- the simplification we’ve developed is domain-specific to emergency
  alerting, so should throw away less information than

There’s a bit more detail about how we simplify polygons in
https://github.com/alphagov/notifications-admin/pull/3590/files
2021-01-26 16:24:45 +00:00
Chris Hill-Scott
26871eeacc Validate CAP against the spec
This gives us some extra confidence that there aren’t any problems with
the data we’re getting from the other service. It doesn’t address any
specific problems we’ve seen, rather it seems like a sensible precaution
to take.
2021-01-26 16:24:45 +00:00
Chris Hill-Scott
38f07db23e Accept CAP XML
This commit makes the existing endpoint also accept CAP XML, should the
appropriate `Content-Type` header be set.

It uses the translation code we added in a previous commit to convert
the CAP to a dict. We can then validate that dict against with the JSON
schema to ensure it’s something we can work with.
2021-01-26 16:24:44 +00:00
Chris Hill-Scott
7530408a21 Validate broadcast against schema
This commit adds a JSONSchema which can validate the fields in an API
call to create a broadcast. It takes the CAP XML schema as a starting
point.
2021-01-26 16:24:44 +00:00
Chris Hill-Scott
61c9e50ed9 Add public API endpoint to create emergency alerts
We know there is at least one system which wants to integrate with
Notify to send out emergency alerts, rather than creating them manually.

This commit adds an endpoint to the public API to let them do that.

To start with we’ll just let the system create them in a single call,
meaning they still have to be approved manually. This reduces the risk
of an attacker being able to broadcast an alert via the API, should the
other system be compromised.

We’ve worked with the owners of the other system to define which fields
we should care about initially.
2021-01-26 16:24:44 +00:00
Pea Tyczynska
dfbd31cef8 Merge pull request #3106 from alphagov/billing-fields-for-service
Add billing details fields to Service model and db table
2021-01-26 15:14:05 +00:00
Katie Smith
2681752f15 Rename bt-ee-proxy to ee-proxy
We want to rename the `bt-ee-1-proxy` lambda function to `ee-1-proxy`.
This change will need to be deployed at the same time that we change
the name of the lambda function in the Terraform.
2021-01-26 14:36:20 +00:00
Richard Baker
6256cdf792 Add proxy client for o2 cell croadcasting
o2 use One-2-many CBC so we can use the O2M/CAP client.

Once differences between CBCs have been worked out we can consolidate O2M clients to reduce duplication.

Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
2021-01-26 11:11:44 +00:00
Pea Tyczynska
b3abdfb401 Rename billing contact email and name fields to plural
So:

'billing_contact_email_address' becomes 'billing_contact_email_addresses'
AND
'billing_contact_name' becomes 'billing_contact_names'

This is to signify that each of those fields can contain numerous
items
2021-01-25 17:53:27 +00:00
Pea Tyczynska
ffac16a2a0 Add new billing details to test_get_service_by_id 2021-01-25 17:42:18 +00:00
David McDonald
c61bd9976f Remove ability for platform admins to approve own broadcast
This has been added in for speed of development but now we are getting
close to integrating with production systems, we will be turning off
these helpful hacks to reduce the risk of someone sending a real
broadcast to citizens.

Note, platforms are still able to approve broadcasts when their service
is in training mode.
2021-01-22 16:56:05 +00:00
Chris Hill-Scott
5c8b5e0488 Merge pull request #3095 from alphagov/allow-admin-to-create-broadcasts-without-templates
Let the admin app create broadcasts without templates
2021-01-19 16:57:33 +00:00
Chris Hill-Scott
94aea8a820 Add test for when no content or template provided
This is the missing invalid permutation of fields for creating a
broadcast.
2021-01-19 15:28:08 +00:00
Pea Tyczynska
882da84182 Merge pull request #3096 from alphagov/add-notes-to-service
Add notes column to services table
2021-01-19 14:45:42 +00:00
David McDonald
ac6837cde5 Downgrade exception to warning for provider API call
When we send an HTTP request to our SMS providers, there is a
chance we get a 5xx status code back from them. Currently we log this as
two different exception level logs.

If a provider has a funny few minutes, we could end up with
hundreds of exceptions thrown and pagerduty waking someone up in the
middle of the night. These problems tend to pretty quickly fix
themselves as we balance traffic from one SMS to the other SMS provider
within 5 minutes.

By downgrading both exceptions to warning in the case of a
`SmsClientResponseException`, we will reduce the change of waking us up
in the middle of the night for no reason.

If the error is not a `SmsClientResponseException`, then we will still
log at the exception level as before as this is more unexpected and we
may want to be alerted sooner.

What we still want to happen though is that let's say both SMS providers
went down at the same time for 1 hour. We don't want our tasks to just
sit there, retrying every 5 minutes for the whole time without us being
aware (so we can at least raise a statuspage update). Luckily we will
still be alerted because our smoke tests will fail after 10 minutes and
raise a p1:
https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21
2021-01-18 17:00:21 +00:00
David McDonald
1cfd19fb6a Reorder tests
So all tests related to sms sending is done first and then all email
sending task tests happen after
2021-01-18 16:46:12 +00:00
Pea Tyczynska
22f2eb7bfe Add notes column to services table 2021-01-18 10:36:51 +00:00
Chris Hill-Scott
4eb4ea1772 Use cache for tasks that save notifications
These tasks need to repeatedly get the same template and service from
the database. We should be able to improve their performance by getting
the template and service from the cache instead, like we do in the REST
endpoint code.
2021-01-18 10:25:24 +00:00
David McDonald
9c01d8018d Merge pull request #3093 from alphagov/broadcast-tasks-onto-worker
Broadcast tasks onto worker
2021-01-15 15:51:35 +00:00
Chris Hill-Scott
e161f6e4a1 Require reference if template not provided
In the admin app we need something to use in show in lieu of template
name when a template isn’t used. Let’s store this in the reference field
for now.
2021-01-15 14:57:36 +00:00
Chris Hill-Scott
0510311d63 Don’t require template when content is provided
So that the admin app can create broadcasts without a template it needs
to be allowed to create broadcasts from content instead.
2021-01-15 14:57:36 +00:00
David McDonald
ff193387d1 Add proxy client for Three
Three uses the One 2 Many technology so should work in the same way as
our proxy for EE
2021-01-14 11:44:46 +00:00
David McDonald
f3ee2cdd48 Merge pull request #3086 from alphagov/lambda-errors
Failover to second lambda on error
2021-01-14 11:15:34 +00:00
Pea Tyczynska
b5a33ded98 Retry with failover lambda for FunctionError and status > 299
For all FunctionErrors, and for invoke errors (status > 299) we
want to retry with failover lambda.

We are doing this, because if there is a connection or other error
with one lambda, the failover lambda may still work and it's
worth trying.

With time, we will probably have more complex retry flow, depending
on the error and even maybe differing for each MNO (broadcast provider).
2021-01-14 10:45:29 +00:00
David McDonald
20627d96ea Put all broadcast tasks on the broadcast worker 2021-01-13 17:21:40 +00:00