If we set an environment variable, we can stub out calls to SES and send
them to our own stub app. If the environment variable is not set, things
work as normal.
To be used alongside
https://github.com/alphagov/notifications-email-provider-stub
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.
The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.
On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit
The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.
Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.
Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
The reporting worker tasks fetch large amounts of data from the db, do
some processing then store back in the database. As the reporting worker
only processes the create nightly billing/stats table tasks, which
aren't high performance or high volume, we're fine with the performance
hit from restarting the worker between every task (which based on
limited local testing takes about a second or so).
This causes some real funky shit with the app_context (used for
accessing current_app.logger). To access flask's global state we use the
standard way of importing `from flask import current_app`. However,
processes after the first one don't have the current_app available on
shut down (they're fine during the actual task running), and are unable
to call `with current_app.app_context()` to create it. They _are_ able
to call `with app.app_context()` to create it, where `app` is the
initial app that we pass in to `NotifyCelery.init_app`.
NotifyCelery.init_app is only called once, in the master process - I
think the application state is then stored and passed to the celery
workers. But then it looks like the teardown might clear it, but it
never gets set up again for the new workers? Unsure.
To fix this, store a copy of the initial flask app on the NotifyCelery
object and then use that from within the shutdown signal logging
function.
Nothing's ever easy ¯\_(ツ)_/¯
We temporarily updated the provider resting points in
https://github.com/alphagov/notifications-api/pull/2804.
This puts the provider resting points back to their original value now
that both providers seem to be functioning well.
We are seeing issues with one of our providers. Move all traffic to the
other. We have done this with the provider load balancer, however this
will be reverted back 10 percent every hour to these resting points,
which we want to stop happening until we are confident our provider has
fixed their issues.
In the long run, we should add functionality to pause our load balancer
behaviour.
Moving it from 4:15am to 3:00. This will mean we do more deleting before
the 'work day' starts and improve our DB performance.
I've looked at the last 14 day of logs for the
`create-nightly-notification-status` subtasks and the
`create-nightly-billing` subtasks. The latest they appear to finish is
2.30AM. There were some outliers but I believe these were people running
the tasks in the middle of the day as a manual process.
Obviously, this still means there is a risk that those tasks conflict
with `delete-notifications-older-than-retention`, even more so now that
we move this to 3am.
Based on this https://github.com/alphagov/notifications-api/pull/2788
where some concerns were raised. This should be a quicker fix to get our
the deletions to run sequentially for all the notification types. Note,
email is first as most important as makes up the larger numbers (we
wouldn't want it to start with SMS, fail half way for a reason that only
affects SMS and for that to affect the email deletion).
We hope that by running sequentially we will reduce conflicts writing to
the same index and this will speed up the total time it takes to finish
deleting all notification types older than their retention time. There
is a risk that whilst quicker per job, as they now run sequentially
rather than potentially overlapping, they will take longer overall. We
will need to monitor to see.
Once a contact list is gone from the database there’s no way to
reference it again. Any jobs have made their own copy.
So we can clean it up, meaning we’re not storing personal data longer
than we need to.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.
In a nutshell:
- If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
- create a save_api_email task to persist the notification
- return the notification
- New worker app to process the save_api_email tasks.
We moved from sending statsd metrics to hosted graphite to sending to
one that is running on the paas. Therefore we no longer need to send
statsd metrics to a particular prefix at the statsd app as it is only
receiving statsd metrics from our apps (not other users like would have
been the case with HostedGraphite).
This should change no behaviour as the only place the environment
variable was being used was in the gunicorn config and it was an empty
string which is the default behaviour anyway as per:
https://docs.gunicorn.org/en/stable/settings.html#statsd-prefix
This will allow us to accept two different ones and therefore allow us
to rotate the secret that the admin client is sending to the API
Due to how the notifications-python-client throws exceptions, we run
into exactly the same issue with not being able to distinguish if a
`TokenDecodeError` is thrown because the token was encrypted with a
different secret key or if because there was a different error when
decoding. I've copied the TODO from `requires_auth` as this is exactly
the same issue.
I've also added a test case for functionality that was missing for an
out of date admin token (old IAT).
And support this change across our code. Note, this is a halfway step
where it is not a list rather than a string but still only supports a
single secret, ie one item in the list.
These alerts are sent to our postal provider. And it usually arrives as they are getting ready to go home for the day or the weekend.
Which means they get missed/overlooked. They have agreed to get the alert an hour earlier, perhaps that will improved the response time.
Added the queue and task names for the new template preview task to the
config. Also added the new bucket name that template preview will use
for the sanitised letters to the config for all environments.
we generally aim to share the load between the two providers equally
(more or less). When one provider has struggled, we deprioritise them,
this commit adds a function that gradually restores balance. It checks
every five minutes, if it's been more than an hour since the providers
were last changed then it adjusts them towards a 50/50 split. Except
it's not quite 50/50 due to #reasons (we want to slightly favour MMG),
it's actually 60/40. That's defined in a new dict in config.py.
these URLs never change, and it lead to surprising issues where an
updated default MMG_URL wasn't actually respected on PaaS. These urls
aren't private and don't need to be stored in credentials.
By not defining them in the manifest, we expect them to use the default
unless `cf set-env` has been specifically used to modify them in an app.
we don't use it since we wrote our own provider stubs for performance
tests.
this removes it from the api - it's still in the DB and will be
retrieved by queries, but is set to disabled on prod
the nightly tasks need to run after the create nightly notification
status task - so that test notifications are still there to record
stats for, and to stop the risk of deleting notificaitons part-way
through recording stats for them.
There's a bug in pysftp that appears to cause quadratic performance loss. See https://github.com/paramiko/paramiko/issues/1141 for more details.
As a temporary band-aid fix, lower the size of the files we're sending.
* The `_should_record_notification_in_history_table` function stopped being
used in this commit: c23ae15f32
* `NOTIFICATIONS_ALERT` stopped being used in this commit: 5aa37f09b6
we build up one personalisation dict, and then pass it in to all the
different templates - so be careful editing things. also of note, we
check if the agreement_signed_on_behalf_of is set, and send a different
template with slightly different wording to the person who clicked the
confirm button.
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
Added a task which runs twice a day on weekdays and checks for letters that have
been in the state of `pending-virus-check` for over 90 minutes. This is
just logging an exception for now, not trying to fix things, since we
will need to manually check where the issue was.