David McDonald 2767328e79 Reduce impact of stale cache for performance page
I came across a bug where the performance page might be visited
on say the 22nd and we expect it to show data for yesterday and
the past 7 days (so the 21st and back 7 days) but instead it
only shows it for the 20th and then back 6 days.

The cause is that we have nightly tasks that run to
calculate the number of notifications sent per service (the
ft_notification_status table)
calculate the number of notifications sent within 10 seconds
(the ft_processing_time table)
Both of these tables get filled between the hours of midnight and
2:30am. The first seems to be filled by about 12:30 every night
whereas the processing stats seems to be filled about 2am every
night.

The admin app currently caches the results of the call to the API
(https://github.com/alphagov/notifications-api/blob/master/app/performance_dashboard/rest.py#L24)
to get this data with the key
performance-stats-{start_date}-to-{end_date}.

Now the issue here is that if the performance page is visited at
00:01 on the 23rd, it will call the API for data on the 22nd and
backwards 7 days. The data it gets at 00:01 will not yet have the
data for the 22nd, it will only go as far as the 21st, because
the overnight jobs to put data in the tables have not run yet. The
admin app then caches this data, meaning that for the rest of the
day the page will be missing data for the 22nd, even though it
becomes available after approx 2am. We have seen it a few times
in production where the performance page has indeed been visited
before say 2 am, leaving the page with missing data.

In terms of solutions, there are several potential options.

1. Remove the caching that we do in redis. The downside of this is
that we will hit the API more (currently once a day for the first
visitor to the performance page). We have had 255 requests
to the performance page in the last week and when there is no value
in redis, the call to the API takes approximately 1-2 seconds.
Removing the caching would against why the caching was originally
added which was to act as a basic protection against malicious
DOS attempts.

2. Make the admin app only set the cache value if it is after
2:30am. This one feels a bit gross and snowflakey.

3. The API flushes the redis cache after it has finished its nightly
tasks. We don't have that much precedent of the API interacting
with redis, it is mostly the admin app that does but this is still
a valid option.

4. Cache in redis the data for 2 days ago to 7 days ago and then
get the data for yesterday freshly every time. This would mean
some things are still cached but the smallest bit that we can't rely
on can be gathered fresh.

5. Remove the caching we do in redis but then try and optimise the
API endpoint to return results even faster and with less
interaction with the DB. My hypothesis is that the slowest part
is where the API has to calculate how many notifications Notify
has ever sent
(https://github.com/alphagov/notifications-api/blob/master/app/performance_dashboard/rest.py#L36)
and so we could potentially try and store these totals in the DB
every night and update them nightly rather than having to query all
the data in the table to figure them out every time.

6. Reduce the expiry time of the cache key so although the bug will
still exist, it will only exist for a much shorter time.

In the end, we've gone for option 6. The user impact of
this bug sticking around for an hour is minimal, in particular
because that would be in the early hours of the morning. The solution
is also quick to implement and keeps the protection against a DOS attack.

The value of 1 hour for expiry was a fairly abitrary choice, it could
have been as little as 1 minute or as much as 6 hours.
2021-12-07 12:44:36 +00:00
2021-02-22 17:19:46 +00:00
2019-11-29 15:25:37 +00:00
2021-03-08 18:48:56 +00:00
2021-03-08 18:48:56 +00:00
2020-03-06 13:25:53 +00:00
2021-12-07 12:44:18 +00:00
2021-12-07 12:44:18 +00:00

notifications-admin

GOV.UK Notify admin application - https://www.notifications.service.gov.uk/

  • Register and manage users
  • Create and manage services
  • Send batch emails and SMS by uploading a CSV
  • Show history of notifications

Setting up

Python version

At the moment we run Python 3.9 in production.

NPM packages

brew install node

NPM is Node's package management tool. n is a tool for managing different versions of Node. The following installs n and uses the long term support (LTS) version of Node.

npm install -g n
n lts

environment.sh

In the root directory of the application, run:

echo "
export NOTIFY_ENVIRONMENT='development'
export FLASK_APP=application.py
export FLASK_ENV=development
export WERKZEUG_DEBUG_PIN=off
"> environment.sh

AWS credentials

To run parts of the app, such as uploading letters, you will need appropriate AWS credentials. See the Wiki for more details.

To run the application

# install dependencies, etc.
make bootstrap

# run the web app
make run-flask

Then visit localhost:6012.

Any Python code changes you make should be picked up automatically in development. If you're developing JavaScript code, run npm run watch to achieve the same.

To test the application

# install dependencies, etc.
make bootstrap

# run all the tests
make test

# continuously run js tests
npm run test-watch

To run a specific JavaScript test, you'll need to copy the full command from package.json.

To update application dependencies

requirements.txt is generated from the requirements.in in order to pin versions of all nested dependencies. If requirements.in has been changed, run make freeze-requirements to regenerate it.

Further docs

Description
The UI of Notify.gov
Readme 545 MiB
Languages
Python 69.3%
HTML 16.6%
JavaScript 11.1%
SCSS 0.9%
Nunjucks 0.7%
Other 1.4%