Commit Graph

7896 Commits

Author SHA1 Message Date
Pea Tyczynska
3037bf5fff Set broadcast message to stubbed when posting broadcast via API 2021-02-09 10:41:36 +00:00
Pea Tyczynska
e0ddb5a39e Merge pull request #3126 from alphagov/fix-cryptography-build-problem
Pin cryptography to a version < 3.4
2021-02-08 17:25:44 +00:00
Pea Tyczynska
7cc8371c7f Pin cryptography to a version < 3.4
One of our dependencies has a dependency on cryptography, which has
recently released version 3.4.

This version introduced a circular import error
(pyca/cryptography#5756) which was fixed in
3.4.1.

However, 3.4.1 has a different error where it fails because it cannot
find a rust compiler.

The suggested
solutions are:

Install a newer version of pip which will install a pre-compiled
cryptography wheel OR
Have rust installed and available on our PATH so that it can be used
to build the package.
Since we can't change the buildpack's pip version and we cannot install
rust ourselves, the only we're left with is to avoid upgrading to 3.4 -
at least until PaaS updates their python buildpacks.
2021-02-08 17:05:46 +00:00
Pea Tyczynska
f8b4c9151c Merge pull request #3122 from alphagov/add-billing-details-orgs
Add billing details for organisation
2021-02-08 16:43:08 +00:00
Leo Hemsted
6b9a50beff Merge pull request #3125 from alphagov/revert-retry
Revert cell broadcast retry logic
2021-02-08 11:16:48 +00:00
Leo Hemsted
bee0059e53 Revert "Merge pull request #3101 from alphagov/retry-broadcasts"
This reverts commit 1bd99c779d, reversing
changes made to d390eb2cac.
2021-02-08 11:02:34 +00:00
Leo Hemsted
49e6ec1ead Revert "Merge pull request #3123 from alphagov/retry-loop-fix"
This reverts commit 541a765811, reversing
changes made to 6a9ac654a6.
2021-02-08 11:01:33 +00:00
Pea Tyczynska
bbc8cffb5b No need to alter the columns in services, they are already the right type 2021-02-08 10:45:28 +00:00
Chris Hill-Scott
33f93dfea2 Merge pull request #3124 from alphagov/handle-xml-with-declaration
Handle XML files that have a declaration
2021-02-08 10:29:11 +00:00
Chris Hill-Scott
dec16a98f6 Handle XML files that have a declaration
`lxml` wants its input in bytes:

> XML is explicitly defined as a stream of bytes. It's not Unicode text.
> […] rule number one: do not decode your XML data yourself.

– https://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings

It will accept strings unless, unless the document contains a
declaration[1] with an `encoding` attribute. Then it will refuse to
parse the document and raises a `ValueError`[2].

We can get fix this by passing `lxml` the bytes from the request, rather
than the decoded text.

1. > XML documents may begin with an XML declaration that describes some
   > information about themselves. An example is
   > `<?xml version="1.0" encoding="UTF-8"?>`.
   – https://en.wikipedia.org/wiki/XML#XML_declaration
2. See an example of this exception being raised in production here:
   https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/doc/logstash-*/logstash-2021.02.05/syslog?id=AXdzfZVz5ZSa5DKpJiYd&_g=()
2021-02-08 08:51:14 +00:00
Leo Hemsted
541a765811 Merge pull request #3123 from alphagov/retry-loop-fix
broadcast event retry loop fix
2021-02-05 14:56:45 +00:00
Pea Tyczynska
aa7bc3d9b4 Serialise org notes and billing details 2021-02-05 14:44:43 +00:00
Pea Tyczynska
02bc87c096 Add billing details and notes for organisation table 2021-02-05 14:44:42 +00:00
Leo Hemsted
d582e35471 dont try and send broadccast event if it's already in technical-failure
this gives us an option to manually set a status in the database and
avoid things being stuck in a retry loop forever
2021-02-05 12:52:37 +00:00
Leo Hemsted
0ddebc63a8 reduce broadcast retry delay to 4 mins and drop prefetch.
### The facts

* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
  that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
  seconds, sqs assumes the celery app has died, and puts it back on the
  queue.

### The situation

A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.

(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)

### The fix

#### Set the max retry delay for broadcast tasks to 240 seconds

Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.

#### Set the prefetch limit to 1 for broadcast workers

This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.

Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.

https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
2021-02-05 12:49:51 +00:00
Leo Hemsted
6a9ac654a6 Merge pull request #3121 from alphagov/dont-update-finishes-at
dont update finishes_at when cancelling broadcast
2021-02-04 14:37:49 +00:00
Leo Hemsted
eff0119f5c dont update finishes at when cancelling broadcast
otherwise we run into issues where we dont issue the cancel as we say
"oh look the expiry time just passed, so we shouldnt send this message
as it's already been removed from the cbc".
2021-02-04 14:25:38 +00:00
Leo Hemsted
1bd99c779d Merge pull request #3101 from alphagov/retry-broadcasts
retry sending broadcasts
2021-02-04 12:01:06 +00:00
Leo Hemsted
1ef3f96bd7 test sending broadcast message for all statuses of existing provider_msg
also clean up some comments
2021-02-04 11:50:06 +00:00
Leo Hemsted
e9f9fe8101 be stricter on broadcast message area validation
even if there is a json struct, make sure it actually contains polygons,
or we'll send to the entire country.
2021-02-03 18:11:54 +00:00
Leo Hemsted
bbae209200 check provider message status etc when sending rather than when retrying
previously if we were deciding whether to retry or not, it meant that
future events wouldn't have context of what the task is doing. We'd
run into issues with not knowing what references to include when
updating/cancelling in future events.

Instead of deciding whether to retry or not, always retry. Instead, when
any event sends, regardless of whether it is a first time or a retry,
check the status of previous events for that broadcast message. There
are a few things that will mean we don't send.

* If the finishes_at time has already elapsed (ie: we have been trying
  to resend this message and haven't had any luck and now the data is
  obselete)
* A previous event has no provider message (this means that we never
  picked the previous event off the queue for some reason)
* A previous event has a provider message that has anything other than
  an ack response. This includes sending (the old message is currently
  being sent), and technical-failure/returned-error (the old message is
  currently in the retry loop, having experienced issues).
2021-02-03 18:11:52 +00:00
Leo Hemsted
96a0935d1c update broadcast provider message status on success/error
so we can distinguish errorring messages that are currently retrying
from those that sent succesfully.
2021-02-03 18:03:16 +00:00
Leo Hemsted
3dcbfc3612 re-use existing provider message if task retries
previously it would crash with a unique constraint error. now, grab the
previous message.
2021-02-03 18:01:54 +00:00
Leo Hemsted
ac34fb9c05 retry sending broadcasts
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:

Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.

Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...

...UNLESS WE DO!

Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.

Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
2021-02-03 16:43:01 +00:00
David McDonald
d390eb2cac Merge pull request #3112 from alphagov/channel-restriction
Set broadcast channel as a service setting
2021-02-03 11:46:04 +00:00
David McDonald
f441d5b4ce Add comment about service channels for updating 2021-02-03 11:37:02 +00:00
David McDonald
41b54b81ee Merge pull request #3116 from alphagov/less-emails
Downgrade exceptions to warnings to reduce emails
2021-02-02 15:21:34 +00:00
David McDonald
070b79c27e Downgrade exceptions to warnings to reduce emails
We already trigger a zendesk ticket for these two cases, meaning that
whenever we get this situation, we get 3 emails. One for the zendesk
ticket, one from logit raising the fact an exception was raised and one
from cloudwatch raising the fact an exception was raised.

We don't need all these emails, a zendesk ticket is sufficient.
Downgrading to a warning means this event will still be findable in our
logs however.
2021-02-02 15:10:26 +00:00
David McDonald
f90b479c8d Use service setting to pick broadcast channel
This falls back to the "test" channel if they do not have a
ServiceBroadcastSetting for the moment, but we intend in future PRs to
enforce that all broadcast services will have this property.
2021-02-01 14:10:41 +00:00
David McDonald
b2ed9efe85 Extend test cases for every MNO
Seems to be no harm in extending these for every mobile network just to
give us slightly better coverage
2021-02-01 14:10:40 +00:00
David McDonald
a46b8c3bba Remove redundant comment 2021-02-01 14:10:39 +00:00
David McDonald
2aad3163e6 Allow CBC proxy client to take channel
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`

We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
2021-02-01 14:10:38 +00:00
David McDonald
91f5be835a Add DB table for service broadcast settings
This will allow us to store details of which channel a service should be
sending to.

See the comment about how all broadcast services can have a row in the
table but may not at the moment. This has been done for speed as it's
the quickest way to let us set up different services to send to
different channels for some needed testing with the mobile handset
providers in the coming week.
2021-02-01 14:10:37 +00:00
David McDonald
a3d966056a Merge pull request #3110 from alphagov/test-channel
Set the default broadcast channel to test
2021-02-01 10:18:35 +00:00
Katie Smith
a31cdb44a6 Merge pull request #3113 from alphagov/bump-utils-43.8.1
Bump utils to 43.8.1
2021-02-01 09:06:39 +00:00
Pea Tyczynska
df4ba22912 Merge pull request #3114 from alphagov/preview-wont-stub
Don't stub broadcasts on preview
2021-01-29 16:00:11 +00:00
Katie Smith
fc9ecaba1d Bump utils to 43.8.1
This brings in the change to stop TV numbers from being treated as
international numbers.
2021-01-29 15:53:35 +00:00
Pea Tyczynska
552e543bc2 Don't stub broadcasts on preview
So that MNOs can use training mode accounts to test end-to-end
broadcast sending. This will enable them to approve their own
broadcasts.
2021-01-29 15:49:50 +00:00
David McDonald
86ea89cf76 Merge pull request #3098 from alphagov/downgrade-to-warning
Downgrade SMS provider request exceptions to warnings
2021-01-29 11:52:10 +00:00
Katie Smith
7b60aeb14d Merge pull request #3109 from alphagov/letter-rates
Add February 2021 letter rates
2021-01-29 10:28:16 +00:00
Katie Smith
d7387869c4 Add February 2021 letter rates
All rates are changing, so we add an end date for the current rates and
insert new rates for every post_class, sheet count and crown status.
2021-01-28 14:35:21 +00:00
Chris Hill-Scott
837e464081 Merge pull request #3060 from alphagov/add-broadcast-api-endpoint
Add public API endpoint to create broadcast messages
2021-01-28 12:59:41 +00:00
Pea Tyczynska
51c0ece130 Merge pull request #3108 from alphagov/stub-training-broadcasts
Stub training broadcasts
2021-01-28 11:58:47 +00:00
Katie Smith
4ed79dae48 Merge pull request #3105 from alphagov/rename-bt-ee
Rename bt-ee-proxy to ee-proxy
2021-01-27 17:05:26 +00:00
David McDonald
f9b1d3d573 Set the default broadcast channel to test
This used to be hardcoded in the CBC proxy but now we will hardcode it
in the cbc_proxy_client.

In a future PR we can start choosing which channel a broadcast will go
to based on the channel configured for that broadcast service.
2021-01-27 15:27:11 +00:00
Pea Tyczynska
d4cc250510 Don't create broadcast provider messages for stubbed broadcasts 2021-01-27 10:20:44 +00:00
Pea Tyczynska
26d6b4a958 Mark broadcast message as stubbed when sent from training account 2021-01-27 10:20:43 +00:00
Pea Tyczynska
a93a35de8d Add 'stubbed' column to broadcast_message table
This is a boolean column. It will be set to True for broadcasts
created from training broadcast accounts.

This will help us debug, for example by excluding all the stubbed
broadcasts when we have some trouble with real broadcasts.
2021-01-27 10:20:43 +00:00
Chris Hill-Scott
ca6c46c4dd Add logging for succesful broadcast message creation 2021-01-26 16:24:45 +00:00
Chris Hill-Scott
0398ac57f1 Use correct HTTP status code for bad content type
415 is the status code for ‘Unsupported media type’

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/415
2021-01-26 16:24:45 +00:00