Update `send_user_2fa_code` to send from number when recipient is international
Update `update_user_attribute` to send from number when recipient is international
Note, I haven't added anything for the `go_live_user` because it doesn't
quite make sense because here a user isn't requesting to go live. So
there should be no reason to record this.
We will in time though want to add audit events to capture every change
to the service broadcast settings, that will actually capture who has
done what.
We are in a weird situation where at the moment, we have services with
the broadcast permission that do not have a row in the
service_broadcast_settings table and therefore do not have defined
whether they should send messages on the 'test' or 'severe' channel.
We currently get around this when we send broadcast messages out as
such:
https://github.com/alphagov/notifications-api/blob/master/app/celery/broadcast_message_tasks.py#L51
We need to something equivalent for the broadcast channel that the API
says the service is on. In time, when we have added a row in the
service_broadcast_settings table for every service with the broadcast
permission then we can remove both of these two hardcodings.
Note, one option would have been to move the default of `test` on to the
`Service` model rather than having it in both the
broadcast_message_tasks file and the `ServiceSchema` class. However, I
went for the quickest thing which was to add it here.
Some of the fixtures weren't needed so have been removed.
I've also moved from using `client.post` to using `admin_request.post`
which saves a bit of code too.
Also one small assertion tidied up to make it a bit stronger regarding
permissions.
We will use this to easily identify all our broadcast services. There
could be other ways to deal with finding and seeing all broadcast
services but this is a good and easy way to start.
We think it would be a security risk to show the name of services
involved in emergency alerts as they be responsible for things such as
counter terrorism.
On top of that, showing broadcast services in the list of all services
could enable someone to use that information to try and trick an admin
into letting them access of a particular service given the fact they
know the name of it
We can use the `sample_broadcast_service` as this gives us a broadcast
service with service broadcast settings already for us to update rather
than needing to create our own settings db row
This will make our `sample_broadcast_service` look more realistic.
The one downside to this, is that there will be a short amount of time
where we have broadcast services that do not have a row in the
service_broadcast_settings table until we have backfilled this data. Our
unit tests therefore won't have coverage for this case but I think the
risk is small and acceptable for the moment as this will no longer be
the case in say a weeks time.
The maximum content count of a broadcast varies depending on its
encoding, so we can’t simply validate it against a schema. This commit
moves to using the validation from `notifications-utils`, and raising a
custom error response.
We don’t support these methods at the moment. Instead we were just
ignoring the `msgType` field, so issuing one of these commands would
cause a new alert to be broadcast 🙃
We might want to support `Cancel` in the future, but for now let’s
reject anything that isn’t `Alert` (CAP terminology for the initial
broadcast).
these will happen if, for example, you have issues connecting to AWS or
permission issues.
Still failover if we get one of these exceptions, as I think it might be
possible to have a problem only related to one of the lambdas.
By adding SerialisedTemplate we can avoid a database call for the template. This is useful when sending many many emails/sms for the same template/version.
Remove 2 extra select queries after the update and commit. Once a transaction is committed SQLAlchemy will query for the db model if referenced after a commit.
This doesn't just relate to precompiled letters, it's actually just
checking that there are not any letters still waiting for a virus check
that should not be. This change to the naming makes it more accurate
and therefore easy to understand
This doesn't just relate to templated letters, it's actually just
checking that there are not any letters still in created that should not
be. This change to the naming makes it more accurate and therefore easy
to understand
This means we will have a much easier way of knowing what the settings
are for a broadcast service.
Note, we can just move data directly into the newer table as there is
nothing on the API or admin app that is putting data in the
`service_broadcast_provider_restriction` table, this was being done
manually for the few services that needed it.
At the moment, if a service callback fails, it will get put on the retry queue.
This causes a potential problem though:
If a service's callback server goes down, we may generate a lot of retries and
this may then put a lot of items on the retry queue. The retry queue is also
responsible for other important parts of Notify such as retrying message
delivery and we don't want a service's callback server going down to have an
impact on the rest of Notify.
Putting the retries on a different queue means that tasks get processed
faster than if they were put back on the same 'service-callbacks' queue.
### The facts
* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
seconds, sqs assumes the celery app has died, and puts it back on the
queue.
### The situation
A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.
(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)
### The fix
#### Set the max retry delay for broadcast tasks to 240 seconds
Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.
#### Set the prefetch limit to 1 for broadcast workers
This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.
Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.
https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveatshttps://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
otherwise we run into issues where we dont issue the cancel as we say
"oh look the expiry time just passed, so we shouldnt send this message
as it's already been removed from the cbc".