notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2026-04-05 18:09:12 -04:00

Author	SHA1	Message	Date
Cliff Hill	1157f5639d	black, isort, flake8 Signed-off-by: Cliff Hill <Clifford.hill@gsa.gov>	2023-12-08 21:43:52 -05:00
Kenneth Kehl	1ecb747c6d	reformat	2023-08-29 14:54:30 -07:00
Ryan Ahearn	bec3c53128	Setup newrelic for cloud.gov environments	2023-01-18 09:20:22 -05:00
Ryan Ahearn	d0b2d58b4a	Exec web startup command This allows for app to receive process signals from cloud foundry	2022-12-01 11:23:49 -05:00
Ryan Ahearn	fa8707f802	Remove run_app_paas script	2022-11-29 14:19:32 -05:00
Ryan Ahearn	286400aa18	Use only stdout logging in cloud.gov	2022-11-22 12:11:11 -05:00
Ryan Ahearn	5682af3747	Run migrations on the first web instance startup	2022-10-18 12:28:00 -04:00
Ryan Ahearn	b7e2dfa7e3	Remove unused scripts files	2022-10-18 11:54:54 -04:00
Jim Moffet	59b72f4853	add devcontainer configs and docker network orchestration	2022-06-13 13:16:32 -07:00
Ben Thorner	3fab7a0ca9	Fix letter functional tests to work in Docker Currently "test_send_letter_notification_via_api" fails at the final stage in create-fake-letter-response-file [^1]: requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6011): Max retries exceeded with url: /notifications/letter/dvla (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff95ffc460>: Failed to establish a new connection: [Errno 111] Connection refused')) This only applies when running in Docker so the default should still be "localhost" for the Flask app itself. [^1]: `5093064533/app/celery/research_mode_tasks.py (L57)`	2022-03-09 11:07:50 +00:00
Ben Thorner	038d47e702	Minor tweaks in response to PR comments	2022-02-25 17:51:53 +00:00
Ben Thorner	c9a9640a4b	Iterate local development with Docker This makes a few changes to: - Make local development consistent with our other apps. It's now faster to start Celery locally since we don't try to build the image each time - this is usually quick, but unnecessary. - Add support for connecting to a local Redis instance. Note that the previous suggestion of "REDIS = True" was incorrect as this would be turned into the literal string "True". I've also co-located and extended the recipes in the Makefile to make them a bit more visible.	2022-02-24 17:15:41 +00:00
Leo Hemsted	39848e6df0	move environment variables to their own lines and set -eu this means that if the environment variable can't be set (for example, if you don't have aws-cli installed) then there's a suitable error message early on.	2022-02-01 16:29:09 +00:00
Leo Hemsted	1f3785a7a3	add script to run celery from within docker as a team we primarily develop locally. However, we've been experiencing issues with pycurl, a subdependency of celery, that is notoriously difficult to install on mac. On top of the existing issues, we're also seeing it conflict with pyproj in bizarre ways (where the order of imports between pyproj and pycurl result in different configurations of dynamically linked C libraries being loaded. You are encouraged to attempt to install pycurl locally, following these instructions: https://github.com/alphagov/notifications-manuals/wiki/Getting-Started#pycurl However, if you aren't having any luck, you can instead now run celery in a docker container. `make run-celery-with-docker` This will build a container, install the dependencies, and run celery (with the default of four concurrent workers). It will pull aws variables from your aws configuration as boto would normally, and it will attempt to connect to your local database with the user `postgres`. If your local database is configured differently (for example, with a different user, or on a different port), then you can set the SQLALCHEMY_DATABASE_URI locally to override that.	2022-02-01 16:29:08 +00:00
Leo Hemsted	d916b07e80	remove old unused scripts common_functions is full of AWS commands to manipulate workers running on ec2 instances. We haven't done any of that for years since we moved to AWS delete_sqs_queues contains scripts to get a list of sqs queues and put their details in a csv, or take a details csv and then delete all those queues. it's not clear what the use-case was for it but no-one's used it for years and we can just use the admin console if we really need to.	2021-12-14 14:02:28 +00:00
Rebecca Law	e7efeec309	Increase the concurrency for the delivery-worker-reporting TL;DR After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term. Longer read: Every night we have 2 "reporting" tasks that run. - create-nightly-billing starts at 00:15 - populates data for ft_billing for the previous days. - 4 days for email - 4 days for sms - 10 days for letters - create-nightly-notification-status starts at 00:30 - populates data for ft_notification - 4 days for email - 4 days for sms - 10 days for letters These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2. This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30. Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task. If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task. At some point all the tasks are completed, some are completed twice.	2021-12-01 11:40:18 +00:00
David McDonald	782aef351c	Merge pull request #3369 from alphagov/remove-o-fair Remove -Ofair option from celery worker	2021-11-16 11:49:52 +00:00
Ben Thorner	82e4c3dad2	Reduce concurrency to match number of CPUs This got missed in [1]. [1]: `9e9091e980`	2021-11-15 16:45:05 +00:00
David McDonald	c646176594	Remove -Ofair option from celery worker In version 4.0 of celery, -Ofair became the default scheduling strategy: https://docs.celeryproject.org/en/latest/history/whatsnew-4.0.html?highlight=fair#ofair-is-now-the-default-scheduling-strategy This appears to still be the case: `5d68d781de/celery/concurrency/asynpool.py (L80)` Note, it took me a while to be certain of this as the documentation for the celery CLI suggests a choice of `default` or `fair` which isn't so useful as both of these are `fair`: https://docs.celeryproject.org/en/latest/reference/cli.html#cmdoption-celery-worker-O	2021-11-15 11:52:57 +00:00
sakisv	9e9091e980	Reduce concurrency for other workers too for consistency Any worker that had `--concurrency` > 4 is now set to 4 for consistency with the how volume workers. See previous commit (Reduce concurrency on high volume workers) for details	2021-11-04 16:31:22 +02:00
sakisv	92086e2090	Reduce concurrency on high volume workers We noticed that having high concurrency led to significant memory usage. The hypothesis is that because of long polling, there are many connections being held open which seems to impact the memory usage. Initially the high concurrency was put in place as a way to get around the lack of long polling: We were spawning multiple processes and each one was doing many requests to SQS to check for and receive new tasks. Now with long polling enabled and reduced concurrency, the workers are much more efficient at their job (the tasks are being picked up so fast that the queues are practically empty) and much lighter on resource requirements. (This last bit will allow us to reduce the memory requirement for heavy workers like the sender and reduce our costs) The concurrency number was chosen semi-arbitrarily: Usually this is set to the number of CPUs available to the system. Because we're running on PaaS and that number is both abstracted and may be claimed for by other processes, we went for a conservative one to also reduce the competion for CPU among the processes of the same worker instance.	2021-11-04 11:38:05 +02:00
Ben Thorner	149976bfab	Run tests directly from the Makefile Contributes to: https://github.com/alphagov/notifications-manuals/issues/9 Precedent: https://github.com/alphagov/notifications-admin/pull/3897	2021-06-16 13:05:55 +01:00
Ben Thorner	321b4913ed	Enforce consistency in imports as part of build This copies the config we use in the admin app, with a few changes as discussed in the PR [1]. We'll apply these to our other apps. [1]: https://github.com/alphagov/notifications-api/pull/3175#issuecomment-795530323	2021-03-12 11:45:21 +00:00
Ben Thorner	af95ad68ea	Move bootstrap tasks into the Makefile This is more consistent with how we run all other tasks. Note that the virtual env setup is not generally applicable, and developers of this repo should follow the guidance in the README.	2021-02-18 09:01:32 +00:00
Ben Thorner	dc6fb1d1f2	Remove unused single test script	2021-02-18 09:01:31 +00:00
Ben Thorner	ba4d399982	Switch to 'make' for running app processes These are simple enough that they don't need their own scripts.	2021-02-18 09:01:26 +00:00
Katie Smith	5eebcf6452	Put service callback retries on a different queue At the moment, if a service callback fails, it will get put on the retry queue. This causes a potential problem though: If a service's callback server goes down, we may generate a lot of retries and this may then put a lot of items on the retry queue. The retry queue is also responsible for other important parts of Notify such as retrying message delivery and we don't want a service's callback server going down to have an impact on the rest of Notify. Putting the retries on a different queue means that tasks get processed faster than if they were put back on the same 'service-callbacks' queue.	2021-02-09 13:31:16 +00:00
David McDonald	78db0f9c2b	Add broadcasts worker and queue This worker will be responsible for handing all broadcasts tasks. It is based on the internal worker which is currently handling broadcast tasks. Concurrency of 2 has been chosen fairly arbitrarily. Gunicorn will be running 4 worker processes so we will end up with the ability to process 8 tasks per app instance given this.	2021-01-13 16:35:27 +00:00
sakisv	9bb9070ba0	Add disk space check for sender worker Reused the existing `ensure_celery_is_running` function to terminate the script	2021-01-04 14:01:19 +02:00
sakisv	1bfdac8417	Temporarily remove disk space check from multi_worker script There seems to be some kind of complication in this script that doesn't allow it to terminate properly. This is being removed for now to allow deploying the rest of the fixes in time for the holiday period.	2020-12-24 18:44:26 +02:00
sakisv	a6ecfd66b6	Terminate instance if it's running out of disk space	2020-12-23 19:40:04 +02:00
sakisv	2108498eb1	Send worker-sender celery logs to /dev/null We are using our custom logger to log to `NOTIFY_LOG_PATH`, so this logging from celery is neither needed nor desired. We also need to define the location of the pidfiles, because of what appears to be a bug in celery where it uses the location of logs to infer the location of the pidfiles if it is not defined, i.e. in this case it was trying to find the pidfiles in `/dev/null/%N.pid`.	2020-12-23 19:39:56 +02:00
Rebecca Law	4a9eca3dff	Remove space between queue names. Doh	2020-10-29 11:14:11 +00:00
Rebecca Law	29b6f84f6c	Revert "Revert "Add a task to save-api-sms for high volume services.""	2020-10-29 11:12:46 +00:00
Rebecca Law	06ff1bf596	Revert "Add a task to save-api-sms for high volume services."	2020-10-27 16:18:57 +00:00
Rebecca Law	3dee4ad310	Add a task to save-api-sms for high volume services. When we initially added a new task to persist the notifications for a high volume service we wanted to implement it as quickly as possible, so ignored SMS. This will allow a high volume service to send SMS, the SMS will be sent to a queue to then persist and send the SMS, similar to emails. At this point I haven't added a new application to consume the new save-api-sms-tasks. But we can add a separate application or be happy with how the app scales for both email and sms.	2020-10-26 13:09:37 +00:00
Leo Hemsted	b01ec05aaf	run migrations if app is down normally we check the app's status page to see if migrations need running. However, if the _status endpoint doesn't respond with 200, we don't necessarily want to abort the deploy - we may be trying to deploy a code fix that fixes that status endpoint for example. We don't know whether to run the migrations or not, so err on the side of caution by re-running the migration. The migration itself might be the fix that gets the app working after all. had to do a little song and dance because sometimes the response won't be populated before an exception is thrown	2020-06-26 15:28:28 +01:00
David McDonald	a237162106	Reduce concurrency and prefetch count of reporting celery app We have seen the reporting app run out of memory multiple times when dealing with overnight tasks. The app runs 11 worker threads and we reduce this to 2 worker threads to put less pressure on a single instance. The number 2 was chosen as most of the tasks processed by the reporting app only take a few minutes and only one or two usually take more than an hour. This would mean with 2 processes across our current 2 instances, a long running task should hopefully only wait behind a few short running tasks before being picked up and therefore we shouldn't see large increase in overall time taken to run all our overnight reporting tasks. On top of reducing the concurrency for the reporting app, we also set CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery docs because this app deals with long running tasks. https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit The chance in prefetch multiplier should again optimise the overall time it takes to process our tasks by ensuring that tasks are given to instances that have (or will soon have) spare workers to deal with them, rather than committing to putting all the tasks on certain workers in advance. Note, another suggestion for improving suggested by the docs for optimising is to start setting `ACKS_LATE` on the long running tasks. This setting would effectively change us from prefetching 1 task per worker to prefetching 0 tasks per worker and further optimise how we distribute our tasks across instances. However, we decided not to try this setting as we weren't sure whether it would conflict with our visibility_timeout. We decided not to spend the time investigating but it may be worth revisiting in the future, as long as tasks are idempotent. Overall, this commit takes us from potentially having all 18 of our reporting tasks get fetched onto a single instance to now having a process that will ensure tasks are distributed more fairly across instances based on when they have available workers to process the tasks.	2020-04-28 10:47:46 +01:00
David McDonald	5d88a1dbf4	Add -Ofair setting to reporting celery app What this setting does is best described in https://medium.com/@taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb#d7ec This should be useful for the reporting app because tasks run by this app are long running (many seconds). Ideally this code change will mean that we are quicker to process the overnight reporting tasks, so they all finish earlier in the morning (although are not individually quicker). This is only being set on the reporting celery app because this change trying to do the minimum possible to improve the reliability and speed of our overnight reporting tasks. It may very well be useful to set this flag on all our apps, but this should be done with some more consideration as some of them will deal with much faster tasks (sub 0.5s) and so it may be still be appropriate or may not. Proper investigation would be needed. Note, the celery docs on this are also worth a read: https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit. However, the language can confuse this with setting with the prefetch limit. The distinction is that prefetch grabs items off the queue, whereas the -Ofair behaviour is to do with when items have already been prefetched and then whether the master celery process straight away gives them to the child (worker) processes or not. Note, this behaviour is default for celery version 4 and above but we are still on version 3.1.26 so we have to enable it ourselves.	2020-04-28 10:42:42 +01:00
Rebecca Law	db4b4d929d	- If the task runs twice and the notification already exists ignore the primary key constraint. - Remove prints - Add some more tests - Only allow the new method to run for emails	2020-03-25 12:39:15 +00:00
Rebecca Law	a13bcc6697	Reduce the pressure on the db for API post email requests. Instead of saving the email notification to the db add it to a queue to save later. This is an attempt to alleviate pressure on the db from the api requests. This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id. In a nutshell: - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db. - create a save_api_email task to persist the notification - return the notification - New worker app to process the save_api_email tasks.	2020-03-25 07:59:05 +00:00
Katie Smith	3a07d1e13d	Create new sms-callbacks queue The `delivery-worker-receipts` app will listen to this new queue, which will be used for processing the responses from Firetext and MMG.	2020-03-19 13:41:14 +00:00
Leo Hemsted	a97948574a	remove unused pytest flags	2020-02-13 12:52:12 +00:00
Rebecca Law	fe18512dd2	Change how the bash script is started. By adding `exec` to the entrypoint bash script for the application, we can trap an EXIT from the script and execute our custom `on_exit` method with checks if the application process is busy before terminating, waiting up to 10 seconds. We don't need to trap `TERM` so that's been removed again. Written by: @servingupaces @tlwr	2019-10-31 16:41:16 +00:00
Toby Lorne	4ad2e30e52	Catch the TERM signal in the run_app_*paas scripts When Cloud Foundry applications are to be rescheduled from one cell to another, or they are stopped, they are sent a SIGTERM signal and 10 seconds later, a SIGKILL signal. Currently the scripts trap the POSIX defined EXIT handler, rather than the signal directly. In order for the signal to properly be propagated to celery, and the celery workers, the script should call the on_exit function when receiving a TERM signal. Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Becca <rebecca.law@digital.cabinet.office.gov.uk> Co-authored-by: Toby <toby.lornewelch-richards@digital.cabinet.office.gov.uk>	2019-10-29 17:21:16 +00:00
Leo Hemsted	3a0bf2b23e	Add reporting worker also remove references to unused statistics queue	2019-08-15 16:42:15 +01:00
Andy Paine	655d5a4e16	AUTO-413: Use an internal app for statsd preview - We are running statsd exporter as an app with a public route for Prometheus to scrape - This updates preview to send statsd metrics over the CF internal networking to the statsd exporter - Removes the sidecar statsd exporters too	2019-05-23 11:10:33 +01:00
Pea Tyczynska	b59bca0fc2	Rename workers so they are less wordy xd	2019-05-01 14:51:43 +01:00
Pea Tyczynska	6163ca8b45	Change distribution of queues among notify delivery workers This is so that retry-tasks queue, which can have quite a lot of load, has its own worker, and other queues are paired with queues that flow similarly: - letter-tasks with create-letters-pdf-tasks - job-tasks with database-tasks	2019-04-30 12:03:06 +01:00
Alexey Bezhan	570cbc3eab	Add statsd_exporter to app PaaS startup scripts `statsd_exporter` is only started if `STATSD_HOST` is set to `localhost`.	2019-04-24 13:50:13 +01:00

1 2 3 4

197 Commits