This fixes the app crashing - apparently due to a lack of free
memory. Looking over the last month, we can see it's been hovering
at ~90% utilisation. 512M seems like a good compromise to avoid
this happening again vs paying for what we use.
The limit was last changed 5 years ago:
https://github.com/alphagov/notifications-api/pull/882
Previously we think this setting was necessary to avoid a memory
leak [1], but it's unclear if this is still an issue:
- We've advanced two major versions of Celery.
- Some of the tasks are now quicker and leaner.
Restarting worker sub-processes after each task is a big problem
for performance, as we move towards parallelising our reporting.
This is something of a test to see if we can manage without this
setting. Note that we need to unset the variable manually:
cf unset-env notify-delivery-worker-reporting CELERYD_MAX_TASKS_PER_CHILD
In the worst case we can always re-run any failed tasks. To check
the worker is still behaving as expected, we can:
- Monitor CPU / memory graphs for it.
- Check `cf events` for unexpected restarts / crashes.
- Compare numbers of task completion logs to previous days.
- Check the number of new billing / status rows looks right.
[1]: ad419f7592
We saw it fail again last night to calculate how many notifications
were sent for one of our services to put in the ft_notification_status
table. It ran in to the sqlalchemy statement timeout again.
To get us through the holiday
period lets make it 2 hours as surely that will be enough and then
we can fix this properly
Having a pool size of 30 connections means that if we receive a big
number of requests, with the current configuration, the API would end up
holding onto 30 connections per worker * 4 workers per instance * 35
instances = 4200 connections. With a limit of 5000 connections, this
means that we would only have 800 connections to share between the
workers or for overflow usage (btw, even the overflow for the API would
take us above the 5000 limit - 10 overflow connections per worker * 4 *
35 = 1400 connections, total 5600 _only_ for the API).
During our load tests this led to a deadlock situation where nothing
could retrieve connections to deal with a queue build-up.
The reduced pool size allowed for a much more graceful degradation of
the service where, after significant load we would increase the response
times but still manage to serve all the requests.
When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.
This is to see if the worker requires slightly more memory than it has access to or to determine if there is a memory leak somewhere in the code that needs to be further investigated.
This comes at the heels of yesterday's issue that we could not process the CSVs the users uploaded, where the memory graph for this worker showed that it was using almost all of its available memory, so a redeploy fixed the problem.
We want to start using Firetext for sending international SMS. They
require us to use a different API key for international SMS because it
requires a new code path to switch the sender ID to something that the
country will accept.
This PR does not include switching the sender of international SMS to
Firetext but sets us up to do so.
This app will replace the `notify-api-sms-callbacks` app as it is an app
that handles receipts, not callbacks.
After this and the corresponding concourse PR is merged and deployed (at which
point we will have two apps sharing the traffic) we can then put in PRs
to remove the `notify-api-sms-callbacks` app.
There is a chance it could all be done as a single PR (or at least one
for the API and one for the concourse pipelines) but I'm playing it safe
and doing it as a very clear two step process just in case.
There is no need to give it to any of the other workers and so the fewer
instances that have these creds the better.
You can verify this works by running
```
CF_APP=notify-api CF_SPACE=preview make generate-manifest
```
vs
```
CF_APP=notify-delivery-worker-broadcasts CF_SPACE=preview make generate-manifest
```
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats
We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
### The facts
* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
seconds, sqs assumes the celery app has died, and puts it back on the
queue.
### The situation
A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.
(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)
### The fix
#### Set the max retry delay for broadcast tasks to 240 seconds
Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.
#### Set the prefetch limit to 1 for broadcast workers
This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.
Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.
https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveatshttps://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
This worker will be responsible for handing all broadcasts tasks.
It is based on the internal worker which is currently handling broadcast
tasks.
Concurrency of 2 has been chosen fairly arbitrarily. Gunicorn will be
running 4 worker processes so we will end up with the ability to process
8 tasks per app instance given this.
On monday, we had a build of emails in the email queue that weren't
getting picked up by the sender worker and causing delays.
After further investigation with Andy from the PaaS, we believe the
following happened.
We received a bunch of traffic at 8:30ish which consisted of some
very large emails in terms of their length and complexity. The amount
of memory used by the app instances got very high and a few apps
crashed due to OOM (recorded by 5 cf app event crashes). When new
app instances tried to spin up, they weren't able to as they
potentially also ran out of memory immediately.
This left us in the position of having fewer app instances than we
needed, on top of which they were all using a very large amount of
CPU and may have been limited how quickly an individual app
instance would process tasks. This meant that we were overall
processing fewer tasks then we needed to and our queue of emails
started to build up.
So it appears our sender workers did not have the memory available that
they needed. By looking at a graph for the past 30 days of memory usage
on the sender workers, we see that it on several days breached 90%
memory usage for long periods of time. This in combination of the
hypothesis above of what happened leads us to decide that we want to
give the app instances a bigger memory quota so it has been upped from
3GB to 4GB.
Whilst doing, I also looked at long term memory usage graphs for our
other workers and saw that the letters worker was similarly close to
around 90% of memory used so have taken the opportunity to bump that
too.
we might need to deal with a potential large volume of SMS delivery
receipts. These receipts are POSTed to oru public api to two URLs. The
actual endpoint just parses the response body and puts a task on an SQS
queue - no database connections are required or anything like that.
Split up this traffic from other traffic, so that any increase in volume
of callbacks won't affect the scaling/load/etc of the main api apps.
We've hard coded instance counts to 10 on prod for now until we get an
idea of load.
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.
The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.
On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit
The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.
Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.
Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
The reporting worker tasks fetch large amounts of data from the db, do
some processing then store back in the database. As the reporting worker
only processes the create nightly billing/stats table tasks, which
aren't high performance or high volume, we're fine with the performance
hit from restarting the worker between every task (which based on
limited local testing takes about a second or so).
This causes some real funky shit with the app_context (used for
accessing current_app.logger). To access flask's global state we use the
standard way of importing `from flask import current_app`. However,
processes after the first one don't have the current_app available on
shut down (they're fine during the actual task running), and are unable
to call `with current_app.app_context()` to create it. They _are_ able
to call `with app.app_context()` to create it, where `app` is the
initial app that we pass in to `NotifyCelery.init_app`.
NotifyCelery.init_app is only called once, in the master process - I
think the application state is then stored and passed to the celery
workers. But then it looks like the teardown might clear it, but it
never gets set up again for the new workers? Unsure.
To fix this, store a copy of the initial flask app on the NotifyCelery
object and then use that from within the shutdown signal logging
function.
Nothing's ever easy ¯\_(ツ)_/¯
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.
In a nutshell:
- If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
- create a save_api_email task to persist the notification
- return the notification
- New worker app to process the save_api_email tasks.
This should add an extra 400 connections maximum, which will not tip us over the allowable 5000 db connections. And it may help with the queue pool connection errors.
We moved from sending statsd metrics to hosted graphite to sending to
one that is running on the paas. Therefore we no longer need to send
statsd metrics to a particular prefix at the statsd app as it is only
receiving statsd metrics from our apps (not other users like would have
been the case with HostedGraphite).
This should change no behaviour as the only place the environment
variable was being used was in the gunicorn config and it was an empty
string which is the default behaviour anyway as per:
https://docs.gunicorn.org/en/stable/settings.html#statsd-prefix
This was after we saw an instance of the API failing it's healthcheck
even though it was still healthy enough to serve requests to users.
This follows the change we've also made to template-preview and admin of
upping the health check timeout. Unlike those where we set it to be 10
seconds, we have been less allowing here and only chosen 2.5 seconds.
This was at suggestion of Toby from PaaS as the api should generally
have quicker response times and more annoyance might be created for
users if we let an instance stick around for 10 seconds where it was
unable to serve requests successfully.
these URLs never change, and it lead to surprising issues where an
updated default MMG_URL wasn't actually respected on PaaS. These urls
aren't private and don't need to be stored in credentials.
By not defining them in the manifest, we expect them to use the default
unless `cf set-env` has been specifically used to modify them in an app.
we don't use it since we wrote our own provider stubs for performance
tests.
this removes it from the api - it's still in the DB and will be
retrieved by queries, but is set to disabled on prod
deploys take up to five minutes, during which notify-paas-autoscaler
can't scale the app. We saw 502s due to a large volume of traffic
coming in during that time, and we couldn't react cos we were
deploying.
scale up to 25 instances, the autoscaler won't be able to downscale
until after the deploy has finished.
- We are running the statsd-exporter on the PaaS now so we can use the
internal UDP route to talk to it
- Only update in preview and staging still so that we can get the
dashboards fully up to date before switching prod
- We are running a statsd exporter on tools to collect all our statsd
metrics for scraping by Prometheus
- Update preview to point there instead of at the local one which has
issues with redeployment and DNS changing
- We've been seeing an issue when traffic spikes of the http health
checks taking over 1s and PaaS killing the app
- Port health checks won't care about being stuck in a queue so should
continue to work even at high loads
- We have functional tests to catch if a deployment brings up the app
(and so passes port health check) but then doesn't work
- We are running statsd exporter as an app with a public route for
Prometheus to scrape
- This updates preview to send statsd metrics over the CF internal
networking to the statsd exporter
- Removes the sidecar statsd exporters too
all apps get a route assigned when using v3-zdt-push.
> By default, the web process has a route and one instance. Other processes have zero instances by default.
([source](https://docs.cloudfoundry.org/devguide/multiple-processes.html))
When we push apps to multiple environments they need different routes
or the second push will fail, so this means that we need to define
routes ourselves for every app.
We're also manually flagging the health-check as either "http" or
"process" - http for the api, process for all others.
If not specified, healthcheck is set to `port` by cloudfoundry - we've
seen some issues with upgrading the deployment from v2 to v3 when using
port - it adds apps to load balancer when they're not ready, which can
result in 404s. by setting healthcheck to http it'll wait for the
/status endpoint to return 200, which will wait for flask to get
everything up and running properly