## The existing situation To support multiple processes and eventlets recording metrics in parallel, prometheus uses files to store metrics. When you write a metric from a multiprocess app, it writes to a file. Prometheus identifies whether your app is multiprocess by looking for the existence of a `prometheus_multiproc_dir` environment var (in either case). Prometheus reads this variable at a module level (ie: at import time). Assuming it will always used within a web server, the gds_metrics library auto-sets this to `/tmp` on import, to ensure that prometheus will always be set up correctly. We also have a variety of metrics set up when we create the app. These are generally sensible metrics such as counting the number of database connections in use by measuring sqlalchemy connection events. ## The problem We have seen problems with our notify-delivery-worker-reporting app run out of space. The CELERYD_MAX_TASKS_PER_CHILD flag is set on that app which restarts each worker process every time a task runs (to avoid memory issues), however we've recently massively decreased the size and increased the number of tasks to parallelise nightly tasks. Each time a worker process restarts it will write a new file to disk. This meant that we quickly ran out of disc space, and then the entire app instance was killed. The big rub is that we don't log prometheus metrics from our worker apps! They don't expose an endpoint so there's no way to scrape them so we aren't getting any value from prometheus anyway! But because they use the same codebase they import gds_metrics and get that anyway. ## The solution gds_metrics sets the multiproc env var, however, by importing prometheus FIRST we ensure that the env var is unset at that point, and thus prometheus will harmlessly store the metrics in memory. To ensure that when we run the notify-api that still has the env var set so the stats are shared across all the gunicorn processes, we put this import as the first thing in run_celery.py
GOV.UK Notify API
Contains:
- the public-facing REST API for GOV.UK Notify, which teams can integrate with using our clients
- an internal-only REST API built using Flask to manage services, users, templates, etc (this is what the admin app talks to)
- asynchronous workers built using Celery to put things on queues and read them off to be processed, sent to providers, updated, etc
Setting Up
Python version
We run python 3.9 both locally and in production.
pycurl
See https://github.com/alphagov/notifications-manuals/wiki/Getting-started#pycurl
AWS credentials
To run the API you will need appropriate AWS credentials. See the Wiki for more details.
environment.sh
Creating and edit an environment.sh file.
echo "
export NOTIFY_ENVIRONMENT='development'
export MMG_API_KEY='MMG_API_KEY'
export FIRETEXT_API_KEY='FIRETEXT_ACTUAL_KEY'
export NOTIFICATION_QUEUE_PREFIX='YOUR_OWN_PREFIX'
export FLASK_APP=application.py
export FLASK_ENV=development
export WERKZEUG_DEBUG_PIN=off
"> environment.sh
Things to change:
- Replace
YOUR_OWN_PREFIXwithlocal_dev_<first name>. - Run the following in the credentials repo to get the API keys.
notify-pass credentials/providers/api_keys
Postgres
Install Postgres.app.
Currently the API works with PostgreSQL 11. After installation, open the Postgres app, open the sidebar, and update or replace the default server with a compatible version.
Note: you may need to add the following directory to your PATH in order to bootstrap the app.
export PATH=${PATH}:/Applications/Postgres.app/Contents/Versions/11/bin/
Redis
To switch redis on you'll need to install it locally. On a OSX we've used brew for this. To use redis caching you need to switch it on by changing the config for development:
REDIS_ENABLED = True
To run the application
# install dependencies, etc.
make bootstrap
# run the web app
make run-flask
# run the background tasks
make run-celery
# run scheduled tasks (optional)
make run-celery-beat
To test the application
# install dependencies, etc.
make bootstrap
make test
To run one off tasks
Tasks are run through the flask command - run flask --help for more information. There are two sections we need to
care about: flask db contains alembic migration commands, and flask command contains all of our custom commands. For
example, to purge all dynamically generated functional test data, do the following:
Locally
flask command purge_functional_test_data -u <functional tests user name prefix>
On the server
cf run-task notify-api "flask command purge_functional_test_data -u <functional tests user name prefix>"
All commands and command options have a --help command if you need more information.