Currently "test_send_letter_notification_via_api" fails at the final
stage in create-fake-letter-response-file [^1]:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6011): Max retries exceeded with url: /notifications/letter/dvla (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff95ffc460>: Failed to establish a new connection: [Errno 111] Connection refused'))
This only applies when running in Docker so the default should still
be "localhost" for the Flask app itself.
[^1]: 5093064533/app/celery/research_mode_tasks.py (L57)
This makes a few changes to:
- Make local development consistent with our other apps. It's now
faster to start Celery locally since we don't try to build the
image each time - this is usually quick, but unnecessary.
- Add support for connecting to a local Redis instance. Note that
the previous suggestion of "REDIS = True" was incorrect as this
would be turned into the literal string "True".
I've also co-located and extended the recipes in the Makefile to
make them a bit more visible.
this means that if the environment variable can't be set (for example,
if you don't have aws-cli installed) then there's a suitable error
message early on.
as a team we primarily develop locally. However, we've been experiencing
issues with pycurl, a subdependency of celery, that is notoriously
difficult to install on mac. On top of the existing issues, we're also
seeing it conflict with pyproj in bizarre ways (where the order of
imports between pyproj and pycurl result in different configurations of
dynamically linked C libraries being loaded.
You are encouraged to attempt to install pycurl locally, following these
instructions: https://github.com/alphagov/notifications-manuals/wiki/Getting-Started#pycurl
However, if you aren't having any luck, you can instead now run celery
in a docker container.
`make run-celery-with-docker`
This will build a container, install the dependencies, and run celery
(with the default of four concurrent workers).
It will pull aws variables from your aws configuration as boto would
normally, and it will attempt to connect to your local database with the
user `postgres`. If your local database is configured differently (for
example, with a different user, or on a different port), then you can
set the SQLALCHEMY_DATABASE_URI locally to override that.
common_functions is full of AWS commands to manipulate workers running
on ec2 instances. We haven't done any of that for years since we moved
to AWS
delete_sqs_queues contains scripts to get a list of sqs queues and put
their details in a csv, or take a details csv and then delete all those
queues.
it's not clear what the use-case was for it but no-one's used it for
years and we can just use the admin console if we really need to.
TL;DR
After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term.
Longer read:
Every night we have 2 "reporting" tasks that run.
- create-nightly-billing starts at 00:15
- populates data for ft_billing for the previous days.
- 4 days for email
- 4 days for sms
- 10 days for letters
- create-nightly-notification-status starts at 00:30
- populates data for ft_notification
- 4 days for email
- 4 days for sms
- 10 days for letters
These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2.
This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30.
Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task.
If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task.
At some point all the tasks are completed, some are completed twice.
Any worker that had `--concurrency` > 4 is now set to 4 for consistency
with the how volume workers.
See previous commit (Reduce concurrency on high volume workers) for
details
We noticed that having high concurrency led to significant memory usage.
The hypothesis is that because of long polling, there are many
connections being held open which seems to impact the memory usage.
Initially the high concurrency was put in place as a way to get around
the lack of long polling: We were spawning multiple processes and each
one was doing many requests to SQS to check for and receive new tasks.
Now with long polling enabled and reduced concurrency, the workers are
much more efficient at their job (the tasks are being picked up so fast
that the queues are practically empty) and much lighter on resource
requirements. (This last bit will allow us to reduce the memory
requirement for heavy workers like the sender and reduce our costs)
The concurrency number was chosen semi-arbitrarily: Usually this is set
to the number of CPUs available to the system. Because we're running on
PaaS and that number is both abstracted and may be claimed for by other
processes, we went for a conservative one to also reduce the competion
for CPU among the processes of the same worker instance.
This is more consistent with how we run all other tasks. Note that
the virtual env setup is not generally applicable, and developers
of this repo should follow the guidance in the README.
At the moment, if a service callback fails, it will get put on the retry queue.
This causes a potential problem though:
If a service's callback server goes down, we may generate a lot of retries and
this may then put a lot of items on the retry queue. The retry queue is also
responsible for other important parts of Notify such as retrying message
delivery and we don't want a service's callback server going down to have an
impact on the rest of Notify.
Putting the retries on a different queue means that tasks get processed
faster than if they were put back on the same 'service-callbacks' queue.
This worker will be responsible for handing all broadcasts tasks.
It is based on the internal worker which is currently handling broadcast
tasks.
Concurrency of 2 has been chosen fairly arbitrarily. Gunicorn will be
running 4 worker processes so we will end up with the ability to process
8 tasks per app instance given this.
There seems to be some kind of complication in this script that doesn't
allow it to terminate properly.
This is being removed for now to allow deploying the rest of the fixes
in time for the holiday period.
We are using our custom logger to log to `NOTIFY_LOG_PATH`, so this
logging from celery is neither needed nor desired.
We also need to define the location of the pidfiles, because of what
appears to be a bug in celery where it uses the location of logs to
infer the location of the pidfiles if it is not defined, i.e. in this
case it was trying to find the pidfiles in `/dev/null/%N.pid`.
When we initially added a new task to persist the notifications for a high volume service we wanted to implement it as quickly as possible, so ignored SMS.
This will allow a high volume service to send SMS, the SMS will be sent to a queue to then persist and send the SMS, similar to emails.
At this point I haven't added a new application to consume the new save-api-sms-tasks. But we can add a separate application or be happy with how the app scales for both email and sms.
normally we check the app's status page to see if migrations need
running. However, if the _status endpoint doesn't respond with 200, we
don't necessarily want to abort the deploy - we may be trying to deploy
a code fix that fixes that status endpoint for example.
We don't know whether to run the migrations or not, so err on the side
of caution by re-running the migration. The migration itself might be
the fix that gets the app working after all.
had to do a little song and dance because sometimes the response won't
be populated before an exception is thrown
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.
The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.
On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit
The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.
Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.
Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
What this setting does is best described in
https://medium.com/@taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb#d7ec
This should be useful for the reporting app because tasks run by this
app are long running (many seconds). Ideally this code change will
mean that we are quicker to process the overnight reporting tasks,
so they all finish earlier in the morning (although are not individually quicker).
This is only being set on the reporting celery app because this
change trying to do the minimum possible to improve the reliability and speed
of our overnight reporting tasks. It may very well be useful to set this
flag on all our apps, but this should be done with some more
consideration as some of them will deal with much faster tasks (sub
0.5s) and so it may be still be appropriate or may not. Proper
investigation would be needed.
Note, the celery docs on this are also worth a read:
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit.
However, the language can confuse this with setting with the prefetch
limit. The distinction is that prefetch grabs items off the queue, whereas the
-Ofair behaviour is to do with when items have already been prefetched
and then whether the master celery process straight away gives them to
the child (worker) processes or not.
Note, this behaviour is default for celery version 4 and above but we
are still on version 3.1.26 so we have to enable it ourselves.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.
In a nutshell:
- If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
- create a save_api_email task to persist the notification
- return the notification
- New worker app to process the save_api_email tasks.
By adding `exec` to the entrypoint bash script for the application, we can trap an EXIT from the script and execute our custom `on_exit` method with checks if the application process is busy before terminating, waiting up to 10 seconds. We don't need to trap `TERM` so that's been removed again.
Written by:
@servingupaces
@tlwr
When Cloud Foundry applications are to be rescheduled from one cell to
another, or they are stopped, they are sent a SIGTERM signal and 10
seconds later, a SIGKILL signal.
Currently the scripts trap the POSIX defined EXIT handler, rather than
the signal directly.
In order for the signal to properly be propagated to celery, and the
celery workers, the script should call the on_exit function when
receiving a TERM signal.
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Becca <rebecca.law@digital.cabinet.office.gov.uk>
Co-authored-by: Toby <toby.lornewelch-richards@digital.cabinet.office.gov.uk>
- We are running statsd exporter as an app with a public route for
Prometheus to scrape
- This updates preview to send statsd metrics over the CF internal
networking to the statsd exporter
- Removes the sidecar statsd exporters too
This is so that retry-tasks queue, which can have quite a lot of
load, has its own worker, and other queues are paired with queues
that flow similarly:
- letter-tasks with create-letters-pdf-tasks
- job-tasks with database-tasks