### The facts * Celery grabs up to 10 tasks from an SQS queue by default * Each broadcast task takes a couple of seconds to execute, or double that if it has to go to the failover proxy * Broadcast tasks delay retry exponentially, up to 300 seconds. * Tasks are acknowledged when celery starts executing them. * If a task is not acknowledged before its visibility timeout of 310 seconds, sqs assumes the celery app has died, and puts it back on the queue. ### The situation A task stuck in a retry loop was reaching its visbility timeout, and as such SQS was duplicating it. We're unsure of the exact cause of reaching its visibility timeout, but there were two contributing factors: The celery prefetch and the delay of 300 seconds. Essentially, celery grabs the task, keeps an eye on it locally while waiting for the delay ETA to come round, then gives the task to a worker to do. However, that worker might already have up to ten tasks that it's grabbed from SQS. This means the worker only has 10 seconds to get through all those tasks and start working on the delayed task, before SQS moves the task back into available. (Note that the delay of 300 seconds is translated into a timestamp based on the time you called self.retry and put the task back on the queue. Whereas the visibility timeout starts ticking from the time that a celery worker picked up the task.) ### The fix #### Set the max retry delay for broadcast tasks to 240 seconds Setting the max delay to 240 seconds means that instead of a 10 second buffer before the visibility timeout is tripped, we've got a 70 second buffer. #### Set the prefetch limit to 1 for broadcast workers This means that each worker will have up to 1 currently executing task, and 1 task pending execution. If it has these, it won't grab any more off the queue, so they can sit there without their visibility timeout ticking up. Setting a prefetch limit to 1 will result in more queries to SQS and a lower throughput. This might be relevant in, eg, sending emails. But the broadcast worker is not hyper-time critical. https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
GOV.UK Notify API
Contains:
- the public-facing REST API for GOV.UK Notify, which teams can integrate with using our clients
- an internal-only REST API built using Flask to manage services, users, templates, etc (this is what the admin app talks to)
- asynchronous workers built using Celery to put things on queues and read them off to be processed, sent to providers, updated, etc
Setting Up
Python version
At the moment we run Python 3.6 in production. You will run into problems if you try to use Python 3.5 or older, or Python 3.7 or newer.
AWS credentials
To run the API you will need appropriate AWS credentials. You should receive these from whoever administrates your AWS account. Make sure you've got both an access key id and a secret access key.
Your aws credentials should be stored in a folder located at ~/.aws. Follow Amazon's instructions for storing them correctly.
Virtualenv
mkvirtualenv -p /usr/local/bin/python3 notifications-api
environment.sh
Creating the environment.sh file. Replace [unique-to-environment] with your something unique to the environment. Your AWS credentials should be set up for notify-tools (the development/CI AWS account).
Create a local environment.sh file containing the following:
echo "
export NOTIFY_ENVIRONMENT='development'
export MMG_API_KEY='MMG_API_KEY'
export FIRETEXT_API_KEY='FIRETEXT_ACTUAL_KEY'
export NOTIFICATION_QUEUE_PREFIX='YOUR_OWN_PREFIX'
export FLASK_APP=application.py
export FLASK_DEBUG=1
export WERKZEUG_DEBUG_PIN=off
"> environment.sh
NOTES:
- Replace the placeholder key and prefix values as appropriate
- The SECRET_KEY and DANGEROUS_SALT should match those in the notifications-admin app.
- The unique prefix for the queue names prevents clashing with others' queues in shared amazon environment and enables filtering by queue name in the SQS interface.
Postgres
Install Postgres.app. You will need admin on your machine to do this.
Choose the version with Additional Releases - you want 9.6. Once you run the app, open the sidebar, remove the default v11 server and create and initialise a v9.6 server.
Redis
To switch redis on you'll need to install it locally. On a OSX we've used brew for this. To use redis caching you need to switch it on by changing the config for development:
REDIS_ENABLED = True
To run the application
First, run scripts/bootstrap.sh to install dependencies and create the databases.
You need to run the api application and a local celery instance.
There are two run scripts for running all the necessary parts.
scripts/run_app.sh
scripts/run_celery.sh
Optionally you can also run this script to run the scheduled tasks:
scripts/run_celery_beat.sh
To test the application
First, ensure that scripts/bootstrap.sh has been run, as it creates the test database.
Then simply run
make test
That will run flake8 for code analysis and our unit test suite. If you wish to run our functional tests, instructions can be found in the notifications-functional-tests repository.
To update application dependencies
requirements.txt file is generated from the requirements-app.txt in order to pin
versions of all nested dependencies. If requirements-app.txt has been changed (or
we want to update the unpinned nested dependencies) requirements.txt should be
regenerated with
make freeze-requirements
requirements.txt should be committed alongside requirements-app.txt changes.
To run one off tasks
Tasks are run through the flask command - run flask --help for more information. There are two sections we need to
care about: flask db contains alembic migration commands, and flask command contains all of our custom commands. For
example, to purge all dynamically generated functional test data, do the following:
Locally
flask command purge_functional_test_data -u <functional tests user name prefix>
On the server
cf run-task notify-api "flask command purge_functional_test_data -u <functional tests user name prefix>"
All commands and command options have a --help command if you need more information.
To create a new worker app
You need to:
- Create new entries for your app in
manifest.yml.j2andscripts/paas_app_wrapper.sh(example) - Update the jenkins deployment job in the notifications-aws repo (example)
- Add the new worker's log group to the list of logs groups we get alerts about and we ship them to kibana (example)
- Optionally add it to the autoscaler (example)
Important:
Before pushing the deployment change on jenkins, read below about the first time deployment.
First time deployment of your new worker
Our deployment flow requires that the app is present in order to proceed with the deployment.
This means that the first deployment of your app must happen manually.
To do this:
- Ensure your code is backwards compatible
- From the root of this repo run
CF_APP=<APP_NAME> make <cf-space> cf-push
Once this is done, you can push your deployment changes to jenkins to have your app deployed on every deployment.