Leo Hemsted bab659c677 reduce number of services we try and delete notifications for
TLDR: Don't return as many services, and only return their IDs and not
the whole service objects.

Context:

the delete notifications nightly task has been taking longer and longer,
and to delete all three notification types in sequence it now takes up
to 8 hours.

This is because we were retrieving all services, loading them into
memory on the worker, and then trying to delete notifications for each
service in turn.

While it does use a fair chunk of IOPS/CPU on our postgres db, we're not
anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1]

The real issue appears to be that the task is CPU bound on the periodic
worker - we see the worker spike up to 100% CPU regularly across the
whole 3am-11am period.

We also noticed that for each notification type the task first processes
services with custom data retention (not many but some of the biggest
users), then deals with all other services. We can see from looking at
kibana that, for example, the task starts at 3am, and the custom data
retention service email deletions are finished by 3:12am. The rest of
the emails don't get deleted until 5am, so we knew that the problem is
with how it handles the other services.

There are currently 17000 services in the database. On a typical day,
~800 services will have notifications that are over 7 days old and need
to be deleted. By only returning these services, we reduce the amount of
data transfer and serialisation that needs to happen. It takes about two
minutes to retrieve the distinct service ids from the notifications
table for sms notifications, but that is only 5% the size of the full
list so cuts down on a lot of processing

Also, by only returning service_ids rather than the whole `Service`
model we avoid sqlalchemy needing to do lots of data serialisation, when
we were only using the `Service.id` field from that result anyway.

[1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics
[2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now
[3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))
2021-11-24 16:18:40 +00:00

GOV.UK Notify API

Contains:

  • the public-facing REST API for GOV.UK Notify, which teams can integrate with using our clients
  • an internal-only REST API built using Flask to manage services, users, templates, etc (this is what the admin app talks to)
  • asynchronous workers built using Celery to put things on queues and read them off to be processed, sent to providers, updated, etc

Setting Up

Python version

We run python 3.9 both locally and in production.

pycurl

See https://github.com/alphagov/notifications-manuals/wiki/Getting-started#pycurl

AWS credentials

To run the API you will need appropriate AWS credentials. See the Wiki for more details.

environment.sh

Creating and edit an environment.sh file.

echo "
export NOTIFY_ENVIRONMENT='development'

export MMG_API_KEY='MMG_API_KEY'
export FIRETEXT_API_KEY='FIRETEXT_ACTUAL_KEY'
export NOTIFICATION_QUEUE_PREFIX='YOUR_OWN_PREFIX'

export FLASK_APP=application.py
export FLASK_ENV=development
export WERKZEUG_DEBUG_PIN=off
"> environment.sh

Things to change:

  • Replace YOUR_OWN_PREFIX with local_dev_<first name>.
  • Run the following in the credentials repo to get the API keys.
notify-pass credentials/providers/api_keys

Postgres

Install Postgres.app.

Currently the API works with PostgreSQL 11. After installation, open the Postgres app, open the sidebar, and update or replace the default server with a compatible version.

Note: you may need to add the following directory to your PATH in order to bootstrap the app.

export PATH=${PATH}:/Applications/Postgres.app/Contents/Versions/11/bin/

Redis

To switch redis on you'll need to install it locally. On a OSX we've used brew for this. To use redis caching you need to switch it on by changing the config for development:

    REDIS_ENABLED = True

To run the application

# install dependencies, etc.
make bootstrap

# run the web app
make run-flask

# run the background tasks
make run-celery

# run scheduled tasks (optional)
make run-celery-beat

To test the application

# install dependencies, etc.
make bootstrap

make test

To update application dependencies

To update application dependencies

requirements.txt is generated from the requirements.in in order to pin versions of all nested dependencies. If requirements.in has been changed, run make freeze-requirements to regenerate it.

To run one off tasks

Tasks are run through the flask command - run flask --help for more information. There are two sections we need to care about: flask db contains alembic migration commands, and flask command contains all of our custom commands. For example, to purge all dynamically generated functional test data, do the following:

Locally

flask command purge_functional_test_data -u <functional tests user name prefix>

On the server

cf run-task notify-api "flask command purge_functional_test_data -u <functional tests user name prefix>"

All commands and command options have a --help command if you need more information.

Further documentation

Description
The API powering Notify.gov
Readme 64 MiB
Languages
Python 98.5%
HCL 0.6%
Jinja 0.5%
Shell 0.3%
Makefile 0.1%