We don’t support these methods at the moment. Instead we were just
ignoring the `msgType` field, so issuing one of these commands would
cause a new alert to be broadcast 🙃
We might want to support `Cancel` in the future, but for now let’s
reject anything that isn’t `Alert` (CAP terminology for the initial
broadcast).
these will happen if, for example, you have issues connecting to AWS or
permission issues.
Still failover if we get one of these exceptions, as I think it might be
possible to have a problem only related to one of the lambdas.
By creating a new `CBCProxyOne2ManyClient` class for the three One2Many
clients to inherit from. These three clients are the same apart from the
`lambda_name` and the `failover_lambda_name`.
By adding SerialisedTemplate we can avoid a database call for the template. This is useful when sending many many emails/sms for the same template/version.
Remove 2 extra select queries after the update and commit. Once a transaction is committed SQLAlchemy will query for the db model if referenced after a commit.
This doesn't just relate to precompiled letters, it's actually just
checking that there are not any letters still waiting for a virus check
that should not be. This change to the naming makes it more accurate
and therefore easy to understand
This doesn't just relate to templated letters, it's actually just
checking that there are not any letters still in created that should not
be. This change to the naming makes it more accurate and therefore easy
to understand
This means we will have a much easier way of knowing what the settings
are for a broadcast service.
Note, we can just move data directly into the newer table as there is
nothing on the API or admin app that is putting data in the
`service_broadcast_provider_restriction` table, this was being done
manually for the few services that needed it.
At the moment, if a service callback fails, it will get put on the retry queue.
This causes a potential problem though:
If a service's callback server goes down, we may generate a lot of retries and
this may then put a lot of items on the retry queue. The retry queue is also
responsible for other important parts of Notify such as retrying message
delivery and we don't want a service's callback server going down to have an
impact on the rest of Notify.
Putting the retries on a different queue means that tasks get processed
faster than if they were put back on the same 'service-callbacks' queue.
One of our dependencies has a dependency on cryptography, which has
recently released version 3.4.
This version introduced a circular import error
(pyca/cryptography#5756) which was fixed in
3.4.1.
However, 3.4.1 has a different error where it fails because it cannot
find a rust compiler.
The suggested
solutions are:
Install a newer version of pip which will install a pre-compiled
cryptography wheel OR
Have rust installed and available on our PATH so that it can be used
to build the package.
Since we can't change the buildpack's pip version and we cannot install
rust ourselves, the only we're left with is to avoid upgrading to 3.4 -
at least until PaaS updates their python buildpacks.
### The facts
* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
seconds, sqs assumes the celery app has died, and puts it back on the
queue.
### The situation
A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.
(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)
### The fix
#### Set the max retry delay for broadcast tasks to 240 seconds
Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.
#### Set the prefetch limit to 1 for broadcast workers
This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.
Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.
https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveatshttps://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
otherwise we run into issues where we dont issue the cancel as we say
"oh look the expiry time just passed, so we shouldnt send this message
as it's already been removed from the cbc".