Commit Graph

197 Commits

Author SHA1 Message Date
Cliff Hill
1157f5639d black, isort, flake8
Signed-off-by: Cliff Hill <Clifford.hill@gsa.gov>
2023-12-08 21:43:52 -05:00
Kenneth Kehl
1ecb747c6d reformat 2023-08-29 14:54:30 -07:00
Ryan Ahearn
bec3c53128 Setup newrelic for cloud.gov environments 2023-01-18 09:20:22 -05:00
Ryan Ahearn
d0b2d58b4a Exec web startup command
This allows for app to receive process signals from cloud foundry
2022-12-01 11:23:49 -05:00
Ryan Ahearn
fa8707f802 Remove run_app_paas script 2022-11-29 14:19:32 -05:00
Ryan Ahearn
286400aa18 Use only stdout logging in cloud.gov 2022-11-22 12:11:11 -05:00
Ryan Ahearn
5682af3747 Run migrations on the first web instance startup 2022-10-18 12:28:00 -04:00
Ryan Ahearn
b7e2dfa7e3 Remove unused scripts files 2022-10-18 11:54:54 -04:00
Jim Moffet
59b72f4853 add devcontainer configs and docker network orchestration 2022-06-13 13:16:32 -07:00
Ben Thorner
3fab7a0ca9 Fix letter functional tests to work in Docker
Currently "test_send_letter_notification_via_api" fails at the final
stage in create-fake-letter-response-file [^1]:

        requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6011): Max retries exceeded with url: /notifications/letter/dvla (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff95ffc460>: Failed to establish a new connection: [Errno 111] Connection refused'))

This only applies when running in Docker so the default should still
be "localhost" for the Flask app itself.

[^1]: 5093064533/app/celery/research_mode_tasks.py (L57)
2022-03-09 11:07:50 +00:00
Ben Thorner
038d47e702 Minor tweaks in response to PR comments 2022-02-25 17:51:53 +00:00
Ben Thorner
c9a9640a4b Iterate local development with Docker
This makes a few changes to:

- Make local development consistent with our other apps. It's now
faster to start Celery locally since we don't try to build the
image each time - this is usually quick, but unnecessary.

- Add support for connecting to a local Redis instance. Note that
the previous suggestion of "REDIS = True" was incorrect as this
would be turned into the literal string "True".

I've also co-located and extended the recipes in the Makefile to
make them a bit more visible.
2022-02-24 17:15:41 +00:00
Leo Hemsted
39848e6df0 move environment variables to their own lines and set -eu
this means that if the environment variable can't be set (for example,
if you don't have aws-cli installed) then there's a suitable error
message early on.
2022-02-01 16:29:09 +00:00
Leo Hemsted
1f3785a7a3 add script to run celery from within docker
as a team we primarily develop locally. However, we've been experiencing
issues with pycurl, a subdependency of celery, that is notoriously
difficult to install on mac. On top of the existing issues, we're also
seeing it conflict with pyproj in bizarre ways (where the order of
imports between pyproj and pycurl result in different configurations of
dynamically linked C libraries being loaded.

You are encouraged to attempt to install pycurl locally, following these
instructions: https://github.com/alphagov/notifications-manuals/wiki/Getting-Started#pycurl

However, if you aren't having any luck, you can instead now run celery
in a docker container.

`make run-celery-with-docker`

This will build a container, install the dependencies, and run celery
(with the default of four concurrent workers).

It will pull aws variables from your aws configuration as boto would
normally, and it will attempt to connect to your local database with the
user `postgres`. If your local database is configured differently (for
example, with a different user, or on a different port), then you can
set the SQLALCHEMY_DATABASE_URI locally to override that.
2022-02-01 16:29:08 +00:00
Leo Hemsted
d916b07e80 remove old unused scripts
common_functions is full of AWS commands to manipulate workers running
on ec2 instances. We haven't done any of that for years since we moved
to AWS

delete_sqs_queues contains scripts to get a list of sqs queues and put
their details in a csv, or take a details csv and then delete all those
queues.

it's not clear what the use-case was for it but no-one's used it for
years and we can just use the admin console if we really need to.
2021-12-14 14:02:28 +00:00
Rebecca Law
e7efeec309 Increase the concurrency for the delivery-worker-reporting
TL;DR
After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term.

Longer read:
Every night we have 2 "reporting" tasks that run.
- create-nightly-billing starts at 00:15
  - populates data for ft_billing for the previous days.
  - 4 days for email
  - 4 days for sms
  - 10 days for letters
- create-nightly-notification-status starts at 00:30
  - populates data for ft_notification
  - 4 days for email
  - 4 days for sms
  - 10 days for letters

These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2.
This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30.
Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task.
If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task.
At some point all the tasks are completed, some are completed twice.
2021-12-01 11:40:18 +00:00
David McDonald
782aef351c Merge pull request #3369 from alphagov/remove-o-fair
Remove -Ofair option from celery worker
2021-11-16 11:49:52 +00:00
Ben Thorner
82e4c3dad2 Reduce concurrency to match number of CPUs
This got missed in [1].

[1]: 9e9091e980
2021-11-15 16:45:05 +00:00
David McDonald
c646176594 Remove -Ofair option from celery worker
In version 4.0 of celery, -Ofair became the default
scheduling strategy:
https://docs.celeryproject.org/en/latest/history/whatsnew-4.0.html?highlight=fair#ofair-is-now-the-default-scheduling-strategy

This appears to still be the case:
5d68d781de/celery/concurrency/asynpool.py (L80)

Note, it took me a while to be certain of this as the documentation
for the celery CLI suggests a choice of `default` or `fair` which
isn't so useful as both of these are `fair`:
https://docs.celeryproject.org/en/latest/reference/cli.html#cmdoption-celery-worker-O
2021-11-15 11:52:57 +00:00
sakisv
9e9091e980 Reduce concurrency for other workers too for consistency
Any worker that had `--concurrency` > 4 is now set to 4 for consistency
with the how volume workers.

See previous commit (Reduce concurrency on high volume workers) for
details
2021-11-04 16:31:22 +02:00
sakisv
92086e2090 Reduce concurrency on high volume workers
We noticed that having high concurrency led to significant memory usage.

The hypothesis is that because of long polling, there are many
connections being held open which seems to impact the memory usage.

Initially the high concurrency was put in place as a way to get around
the lack of long polling: We were spawning multiple processes and each
one was doing many requests to SQS to check for and receive new tasks.

Now with long polling enabled and reduced concurrency, the workers are
much more efficient at their job (the tasks are being picked up so fast
that the queues are practically empty) and much lighter on resource
requirements. (This last bit will allow us to reduce the memory
requirement for heavy workers like the sender and reduce our costs)

The concurrency number was chosen semi-arbitrarily: Usually this is set
to the number of CPUs available to the system. Because we're running on
PaaS and that number is both abstracted and may be claimed for by other
processes, we went for a conservative one to also reduce the competion
for CPU among the processes of the same worker instance.
2021-11-04 11:38:05 +02:00
Ben Thorner
149976bfab Run tests directly from the Makefile
Contributes to: https://github.com/alphagov/notifications-manuals/issues/9

Precedent: https://github.com/alphagov/notifications-admin/pull/3897
2021-06-16 13:05:55 +01:00
Ben Thorner
321b4913ed Enforce consistency in imports as part of build
This copies the config we use in the admin app, with a few changes
as discussed in the PR [1]. We'll apply these to our other apps.

[1]: https://github.com/alphagov/notifications-api/pull/3175#issuecomment-795530323
2021-03-12 11:45:21 +00:00
Ben Thorner
af95ad68ea Move bootstrap tasks into the Makefile
This is more consistent with how we run all other tasks. Note that
the virtual env setup is not generally applicable, and developers
of this repo should follow the guidance in the README.
2021-02-18 09:01:32 +00:00
Ben Thorner
dc6fb1d1f2 Remove unused single test script 2021-02-18 09:01:31 +00:00
Ben Thorner
ba4d399982 Switch to 'make' for running app processes
These are simple enough that they don't need their own scripts.
2021-02-18 09:01:26 +00:00
Katie Smith
5eebcf6452 Put service callback retries on a different queue
At the moment, if a service callback fails, it will get put on the retry queue.
This causes a potential problem though:

If a service's callback server goes down, we may generate a lot of retries and
this may then put a lot of items on the retry queue. The retry queue is also
responsible for other important parts of Notify such as retrying message
delivery and we don't want a service's callback server going down to have an
impact on the rest of Notify.

Putting the retries on a different queue means that tasks get processed
faster than if they were put back on the same 'service-callbacks' queue.
2021-02-09 13:31:16 +00:00
David McDonald
78db0f9c2b Add broadcasts worker and queue
This worker will be responsible for handing all broadcasts tasks.

It is based on the internal worker which is currently handling broadcast
tasks.

Concurrency of 2 has been chosen fairly arbitrarily. Gunicorn will be
running 4 worker processes so we will end up with the ability to process
8 tasks per app instance given this.
2021-01-13 16:35:27 +00:00
sakisv
9bb9070ba0 Add disk space check for sender worker
Reused the existing `ensure_celery_is_running` function to terminate the
script
2021-01-04 14:01:19 +02:00
sakisv
1bfdac8417 Temporarily remove disk space check from multi_worker script
There seems to be some kind of complication in this script that doesn't
allow it to terminate properly.

This is being removed for now to allow deploying the rest of the fixes
in time for the holiday period.
2020-12-24 18:44:26 +02:00
sakisv
a6ecfd66b6 Terminate instance if it's running out of disk space 2020-12-23 19:40:04 +02:00
sakisv
2108498eb1 Send worker-sender celery logs to /dev/null
We are using our custom logger to log to `NOTIFY_LOG_PATH`, so this
logging from celery is neither needed nor desired.

We also need to define the location of the pidfiles, because of what
appears to be a bug in celery where it uses the location of logs to
infer the location of the pidfiles if it is not defined, i.e. in this
case it was trying to find the pidfiles in `/dev/null/%N.pid`.
2020-12-23 19:39:56 +02:00
Rebecca Law
4a9eca3dff Remove space between queue names. Doh 2020-10-29 11:14:11 +00:00
Rebecca Law
29b6f84f6c Revert "Revert "Add a task to save-api-sms for high volume services."" 2020-10-29 11:12:46 +00:00
Rebecca Law
06ff1bf596 Revert "Add a task to save-api-sms for high volume services." 2020-10-27 16:18:57 +00:00
Rebecca Law
3dee4ad310 Add a task to save-api-sms for high volume services.
When we initially added a new task to persist the notifications for a high volume service we wanted to implement it as quickly as possible, so ignored SMS.
This will allow a high volume service to send SMS, the SMS will be sent to a queue to then persist and send the SMS, similar to emails.

At this point I haven't added a new application to consume the new save-api-sms-tasks. But we can add a separate application or be happy with how the app scales for both email and sms.
2020-10-26 13:09:37 +00:00
Leo Hemsted
b01ec05aaf run migrations if app is down
normally we check the app's status page to see if migrations need
running. However, if the _status endpoint doesn't respond with 200, we
don't necessarily want to abort the deploy - we may be trying to deploy
a code fix that fixes that status endpoint for example.

We don't know whether to run the migrations or not, so err on the side
of caution by re-running the migration. The migration itself might be
the fix that gets the app working after all.

had to do a little song and dance because sometimes the response won't
be populated before an exception is thrown
2020-06-26 15:28:28 +01:00
David McDonald
a237162106 Reduce concurrency and prefetch count of reporting celery app
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.

The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.

On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit

The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.

Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.

Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
2020-04-28 10:47:46 +01:00
David McDonald
5d88a1dbf4 Add -Ofair setting to reporting celery app
What this setting does is best described in
https://medium.com/@taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb#d7ec

This should be useful for the reporting app because tasks run by this
app are long running (many seconds). Ideally this code change will
mean that we are quicker to process the overnight reporting tasks,
so they all finish earlier in the morning (although are not individually quicker).

This is only being set on the reporting celery app because this
change trying to do the minimum possible to improve the reliability and speed
of our overnight reporting tasks. It may very well be useful to set this
flag on all our apps, but this should be done with some more
consideration as some of them will deal with much faster tasks (sub
0.5s) and so it may be still be appropriate or may not. Proper
investigation would be needed.

Note, the celery docs on this are also worth a read:
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit.
However, the language can confuse this with setting with the prefetch
limit. The distinction is that prefetch grabs items off the queue, whereas the
-Ofair behaviour is to do with when items have already been prefetched
and then whether the master celery process straight away gives them to
the child (worker) processes or not.

Note, this behaviour is default for celery version 4 and above but we
are still on version 3.1.26 so we have to enable it ourselves.
2020-04-28 10:42:42 +01:00
Rebecca Law
db4b4d929d - If the task runs twice and the notification already exists ignore the primary key constraint.
- Remove prints
- Add some more tests
- Only allow the new method to run for emails
2020-03-25 12:39:15 +00:00
Rebecca Law
a13bcc6697 Reduce the pressure on the db for API post email requests.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.

In a nutshell:
 - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
 - create a save_api_email task to persist the notification
 - return the notification
 - New worker app to process the save_api_email tasks.
2020-03-25 07:59:05 +00:00
Katie Smith
3a07d1e13d Create new sms-callbacks queue
The `delivery-worker-receipts` app will listen to this new queue, which will
be used for processing the responses from Firetext and MMG.
2020-03-19 13:41:14 +00:00
Leo Hemsted
a97948574a remove unused pytest flags 2020-02-13 12:52:12 +00:00
Rebecca Law
fe18512dd2 Change how the bash script is started.
By adding `exec` to the entrypoint bash script for the application, we can trap an EXIT from the script and execute our custom `on_exit` method with checks if the application process is busy before terminating, waiting up to 10 seconds. We don't need to trap `TERM` so that's been removed again.

Written by:
@servingupaces
@tlwr
2019-10-31 16:41:16 +00:00
Toby Lorne
4ad2e30e52 Catch the TERM signal in the run_app_*paas scripts
When Cloud Foundry applications are to be rescheduled from one cell to
another, or they are stopped, they are sent a SIGTERM signal and 10
seconds later, a SIGKILL signal.

Currently the scripts trap the POSIX defined EXIT handler, rather than
the signal directly.

In order for the signal to properly be propagated to celery, and the
celery workers, the script should call the on_exit function when
receiving a TERM signal.

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Becca <rebecca.law@digital.cabinet.office.gov.uk>
Co-authored-by: Toby <toby.lornewelch-richards@digital.cabinet.office.gov.uk>
2019-10-29 17:21:16 +00:00
Leo Hemsted
3a0bf2b23e Add reporting worker
also remove references to unused statistics queue
2019-08-15 16:42:15 +01:00
Andy Paine
655d5a4e16 AUTO-413: Use an internal app for statsd preview
- We are running statsd exporter as an app with a public route for
  Prometheus to scrape
- This updates preview to send statsd metrics over the CF internal
  networking to the statsd exporter
- Removes the sidecar statsd exporters too
2019-05-23 11:10:33 +01:00
Pea Tyczynska
b59bca0fc2 Rename workers so they are less wordy xd 2019-05-01 14:51:43 +01:00
Pea Tyczynska
6163ca8b45 Change distribution of queues among notify delivery workers
This is so that retry-tasks queue, which can have quite a lot of
load, has its own worker, and other queues are paired with queues
that flow similarly:
- letter-tasks with create-letters-pdf-tasks
- job-tasks with database-tasks
2019-04-30 12:03:06 +01:00
Alexey Bezhan
570cbc3eab Add statsd_exporter to app PaaS startup scripts
`statsd_exporter` is only started if `STATSD_HOST` is set to `localhost`.
2019-04-24 13:50:13 +01:00