It is currently 60 seconds but we have had two incidents in the
past week where there is a connection error talking to a service
and the request takes up to 60 seconds before failing. When this
happens, if there are a few of these callbacks then all of them
will completely hog the service callback worker and build up a big
queue of all the other service callbacks.
5 seconds has been chosen as that is still a pretty decent length
time for a simple web request that should just be giving them a
little bit of information for them to store. 5 seconds should be a
sufficient enough reduction that we dramatically reduce this problem
for the moment.
Open to this number
being changed in the future based on how we see it perform.
if we're served a 429, put the item on the retry queue and retry the
same as if the service returned a 5xx. 429 is commonly returned for rate
limit exceeding, and retrying on a delay is a typical response to that.
Now that https://github.com/alphagov/notifications-api/pull/3184 has
been deployed for a while, the `send_delivery_status_to_service` task will
always have `template_id` and `template_version` being passed in. This
means we don't need to check if those fields are there.
This adds the `template_id` and `template_version` fields to the data
sent to services from the `send_delivery_status_to_service` task.
We need to account for the task not being passed these fields at first
since there might be tasks retrying which don't have that data. Once all
tasks have been called with the new fields we can then update the code
to assume they are always there.
Since we only send delivery status callbacks for SMS and emails, I've
removed the tests where we call that task with letters.
At the moment, if a service callback fails, it will get put on the retry queue.
This causes a potential problem though:
If a service's callback server goes down, we may generate a lot of retries and
this may then put a lot of items on the retry queue. The retry queue is also
responsible for other important parts of Notify such as retrying message
delivery and we don't want a service's callback server going down to have an
impact on the rest of Notify.
Putting the retries on a different queue means that tasks get processed
faster than if they were put back on the same 'service-callbacks' queue.
We don't retry any callbacks when it receives a 4xx status. We should
probably be aware of this happening and at the moment there is nothing
in our logs to easily identify whether the request failed and is being
retried or if it failed and is not being retried. This will enable us to
search our logs easily and figure out how much it's happening.
It's quite likely that we should in the future allow callbacks to retry
if they get a 429 http response (rate limiting) but we should do this in
a smart way (exponential backoff) and so this is a first step to being
aware of how big a problem it is in case we want to do something about
it.
This is because that error is caused by our providers and we
cannot do anything about it but it can make our logs hard to read
and actionable errors harder to spot
- Updated notifications_dao.update_notification_status_by_id with an optional parameter to set the sent_by, this will eliminate a separate update to notifcaitons.
- Added the callback url to the log message, that way we can see if it's the same url failing.
- Stop sending the status callbacks for PENDING status.
Which means the sent_at date for the notification could be empty causing the service callback to fail.
- Allow code to work if notification.sent_at or updated_at is None
- Update calls to send_delivery_status_to_service to send the data encrypted so that the task does not need to use the db.
Which means we can remove the need to request the data from the database.
In order for the PR to be backwards compatible I have added an optional parameter "encrypted_status_update".
If this is not None then the new code is called.
The next PR will send the encrypted data to this task.
A final PR will remove the code that uses the database to get the notification and service callback api.
Flask-SQLAlchemy sets up a connection pool with 5 connections and will
create up to 10 additional connections if all the pool ones are in use.
If all connections in the pool and all overflow connections are in
use, SQLAlchemy will block new DB sessions until a connection becomes
available. If a session can't acquire a connections for a specified
time (we set it to 30s) then a TimeoutError is raised.
By default db.session is deleted with the related context object
(so when the request is finished or app context is discarded).
This effectively limits the number of concurrent requests/tasks with
multithreaded gunicorn/celery workers to the maximum DB connection pool
size. Most of the time these limits are fine since the API requests are
relatively quick and are mainly interacting with the database anyway.
Service callbacks however have to make an HTTP request to a third party.
If these requests start taking a long time and the number of threads is
larger than the number of DB connections then remaining threads will
start blocking and potentially failing if it takes more than 30s to
acquire a connection. For example if a 100 threads start running tasks
that take 20s each with a max DB connection pool size of 10 then first 10
threads will acquire a connection right away, next 10 tasks will block for
20 seconds before the initial connections are released and all other tasks
will raise a TimeoutError after 30 seconds.
To avoid this, we perform all database operations at the beginning of
the task and then explicitly close the DB session before sending the
HTTP request to the service callback URL. Closing the session ends
the transaction and frees up the connection, making it available for
other tasks. Making calls to the DB after calling `close` will acquire
a new connection. This means that tasks are still limited to running
at most 15 queries at the same time, but can have a lot more concurrent
HTTP requests in progress.
- check if service_callback_api exist before putting tasks on queue
- create_service_callback_api in tests before asserting if send_delivery_status_to_service has been called.