notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-17 02:32:32 -05:00

Author	SHA1	Message	Date
Ben Thorner	149976bfab	Run tests directly from the Makefile Contributes to: https://github.com/alphagov/notifications-manuals/issues/9 Precedent: https://github.com/alphagov/notifications-admin/pull/3897	2021-06-16 13:05:55 +01:00
Ben Thorner	321b4913ed	Enforce consistency in imports as part of build This copies the config we use in the admin app, with a few changes as discussed in the PR [1]. We'll apply these to our other apps. [1]: https://github.com/alphagov/notifications-api/pull/3175#issuecomment-795530323	2021-03-12 11:45:21 +00:00
Ben Thorner	af95ad68ea	Move bootstrap tasks into the Makefile This is more consistent with how we run all other tasks. Note that the virtual env setup is not generally applicable, and developers of this repo should follow the guidance in the README.	2021-02-18 09:01:32 +00:00
Ben Thorner	dc6fb1d1f2	Remove unused single test script	2021-02-18 09:01:31 +00:00
Ben Thorner	ba4d399982	Switch to 'make' for running app processes These are simple enough that they don't need their own scripts.	2021-02-18 09:01:26 +00:00
Katie Smith	5eebcf6452	Put service callback retries on a different queue At the moment, if a service callback fails, it will get put on the retry queue. This causes a potential problem though: If a service's callback server goes down, we may generate a lot of retries and this may then put a lot of items on the retry queue. The retry queue is also responsible for other important parts of Notify such as retrying message delivery and we don't want a service's callback server going down to have an impact on the rest of Notify. Putting the retries on a different queue means that tasks get processed faster than if they were put back on the same 'service-callbacks' queue.	2021-02-09 13:31:16 +00:00
David McDonald	78db0f9c2b	Add broadcasts worker and queue This worker will be responsible for handing all broadcasts tasks. It is based on the internal worker which is currently handling broadcast tasks. Concurrency of 2 has been chosen fairly arbitrarily. Gunicorn will be running 4 worker processes so we will end up with the ability to process 8 tasks per app instance given this.	2021-01-13 16:35:27 +00:00
sakisv	9bb9070ba0	Add disk space check for sender worker Reused the existing `ensure_celery_is_running` function to terminate the script	2021-01-04 14:01:19 +02:00
sakisv	1bfdac8417	Temporarily remove disk space check from multi_worker script There seems to be some kind of complication in this script that doesn't allow it to terminate properly. This is being removed for now to allow deploying the rest of the fixes in time for the holiday period.	2020-12-24 18:44:26 +02:00
sakisv	a6ecfd66b6	Terminate instance if it's running out of disk space	2020-12-23 19:40:04 +02:00
sakisv	2108498eb1	Send worker-sender celery logs to /dev/null We are using our custom logger to log to `NOTIFY_LOG_PATH`, so this logging from celery is neither needed nor desired. We also need to define the location of the pidfiles, because of what appears to be a bug in celery where it uses the location of logs to infer the location of the pidfiles if it is not defined, i.e. in this case it was trying to find the pidfiles in `/dev/null/%N.pid`.	2020-12-23 19:39:56 +02:00
Rebecca Law	4a9eca3dff	Remove space between queue names. Doh	2020-10-29 11:14:11 +00:00
Rebecca Law	29b6f84f6c	Revert "Revert "Add a task to save-api-sms for high volume services.""	2020-10-29 11:12:46 +00:00
Rebecca Law	06ff1bf596	Revert "Add a task to save-api-sms for high volume services."	2020-10-27 16:18:57 +00:00
Rebecca Law	3dee4ad310	Add a task to save-api-sms for high volume services. When we initially added a new task to persist the notifications for a high volume service we wanted to implement it as quickly as possible, so ignored SMS. This will allow a high volume service to send SMS, the SMS will be sent to a queue to then persist and send the SMS, similar to emails. At this point I haven't added a new application to consume the new save-api-sms-tasks. But we can add a separate application or be happy with how the app scales for both email and sms.	2020-10-26 13:09:37 +00:00
Leo Hemsted	b01ec05aaf	run migrations if app is down normally we check the app's status page to see if migrations need running. However, if the _status endpoint doesn't respond with 200, we don't necessarily want to abort the deploy - we may be trying to deploy a code fix that fixes that status endpoint for example. We don't know whether to run the migrations or not, so err on the side of caution by re-running the migration. The migration itself might be the fix that gets the app working after all. had to do a little song and dance because sometimes the response won't be populated before an exception is thrown	2020-06-26 15:28:28 +01:00
David McDonald	a237162106	Reduce concurrency and prefetch count of reporting celery app We have seen the reporting app run out of memory multiple times when dealing with overnight tasks. The app runs 11 worker threads and we reduce this to 2 worker threads to put less pressure on a single instance. The number 2 was chosen as most of the tasks processed by the reporting app only take a few minutes and only one or two usually take more than an hour. This would mean with 2 processes across our current 2 instances, a long running task should hopefully only wait behind a few short running tasks before being picked up and therefore we shouldn't see large increase in overall time taken to run all our overnight reporting tasks. On top of reducing the concurrency for the reporting app, we also set CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery docs because this app deals with long running tasks. https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit The chance in prefetch multiplier should again optimise the overall time it takes to process our tasks by ensuring that tasks are given to instances that have (or will soon have) spare workers to deal with them, rather than committing to putting all the tasks on certain workers in advance. Note, another suggestion for improving suggested by the docs for optimising is to start setting `ACKS_LATE` on the long running tasks. This setting would effectively change us from prefetching 1 task per worker to prefetching 0 tasks per worker and further optimise how we distribute our tasks across instances. However, we decided not to try this setting as we weren't sure whether it would conflict with our visibility_timeout. We decided not to spend the time investigating but it may be worth revisiting in the future, as long as tasks are idempotent. Overall, this commit takes us from potentially having all 18 of our reporting tasks get fetched onto a single instance to now having a process that will ensure tasks are distributed more fairly across instances based on when they have available workers to process the tasks.	2020-04-28 10:47:46 +01:00
David McDonald	5d88a1dbf4	Add -Ofair setting to reporting celery app What this setting does is best described in https://medium.com/@taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb#d7ec This should be useful for the reporting app because tasks run by this app are long running (many seconds). Ideally this code change will mean that we are quicker to process the overnight reporting tasks, so they all finish earlier in the morning (although are not individually quicker). This is only being set on the reporting celery app because this change trying to do the minimum possible to improve the reliability and speed of our overnight reporting tasks. It may very well be useful to set this flag on all our apps, but this should be done with some more consideration as some of them will deal with much faster tasks (sub 0.5s) and so it may be still be appropriate or may not. Proper investigation would be needed. Note, the celery docs on this are also worth a read: https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit. However, the language can confuse this with setting with the prefetch limit. The distinction is that prefetch grabs items off the queue, whereas the -Ofair behaviour is to do with when items have already been prefetched and then whether the master celery process straight away gives them to the child (worker) processes or not. Note, this behaviour is default for celery version 4 and above but we are still on version 3.1.26 so we have to enable it ourselves.	2020-04-28 10:42:42 +01:00
Rebecca Law	db4b4d929d	- If the task runs twice and the notification already exists ignore the primary key constraint. - Remove prints - Add some more tests - Only allow the new method to run for emails	2020-03-25 12:39:15 +00:00
Rebecca Law	a13bcc6697	Reduce the pressure on the db for API post email requests. Instead of saving the email notification to the db add it to a queue to save later. This is an attempt to alleviate pressure on the db from the api requests. This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id. In a nutshell: - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db. - create a save_api_email task to persist the notification - return the notification - New worker app to process the save_api_email tasks.	2020-03-25 07:59:05 +00:00
Katie Smith	3a07d1e13d	Create new sms-callbacks queue The `delivery-worker-receipts` app will listen to this new queue, which will be used for processing the responses from Firetext and MMG.	2020-03-19 13:41:14 +00:00
Leo Hemsted	a97948574a	remove unused pytest flags	2020-02-13 12:52:12 +00:00
Rebecca Law	fe18512dd2	Change how the bash script is started. By adding `exec` to the entrypoint bash script for the application, we can trap an EXIT from the script and execute our custom `on_exit` method with checks if the application process is busy before terminating, waiting up to 10 seconds. We don't need to trap `TERM` so that's been removed again. Written by: @servingupaces @tlwr	2019-10-31 16:41:16 +00:00
Toby Lorne	4ad2e30e52	Catch the TERM signal in the run_app_*paas scripts When Cloud Foundry applications are to be rescheduled from one cell to another, or they are stopped, they are sent a SIGTERM signal and 10 seconds later, a SIGKILL signal. Currently the scripts trap the POSIX defined EXIT handler, rather than the signal directly. In order for the signal to properly be propagated to celery, and the celery workers, the script should call the on_exit function when receiving a TERM signal. Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Becca <rebecca.law@digital.cabinet.office.gov.uk> Co-authored-by: Toby <toby.lornewelch-richards@digital.cabinet.office.gov.uk>	2019-10-29 17:21:16 +00:00
Leo Hemsted	3a0bf2b23e	Add reporting worker also remove references to unused statistics queue	2019-08-15 16:42:15 +01:00
Andy Paine	655d5a4e16	AUTO-413: Use an internal app for statsd preview - We are running statsd exporter as an app with a public route for Prometheus to scrape - This updates preview to send statsd metrics over the CF internal networking to the statsd exporter - Removes the sidecar statsd exporters too	2019-05-23 11:10:33 +01:00
Pea Tyczynska	b59bca0fc2	Rename workers so they are less wordy xd	2019-05-01 14:51:43 +01:00
Pea Tyczynska	6163ca8b45	Change distribution of queues among notify delivery workers This is so that retry-tasks queue, which can have quite a lot of load, has its own worker, and other queues are paired with queues that flow similarly: - letter-tasks with create-letters-pdf-tasks - job-tasks with database-tasks	2019-04-30 12:03:06 +01:00
Alexey Bezhan	570cbc3eab	Add statsd_exporter to app PaaS startup scripts `statsd_exporter` is only started if `STATSD_HOST` is set to `localhost`.	2019-04-24 13:50:13 +01:00
Leo Hemsted	95ca533df2	add delivery-celery-beat to paas_app_wrapper Forget about beats by dre	2019-04-12 11:10:18 +01:00
Leo Hemsted	8cc5d40291	remove old manifest files and creation script	2019-04-10 15:21:30 +01:00
Leo Hemsted	7d9cd58e89	rename public-api to api we don't use the public-api bit anywhere - even cloudwatch overwrites based on CW_APP_NAME (which we can get rid of as this distinction is gone)	2019-04-10 15:19:46 +01:00
Leo Hemsted	fe77d4f654	move commands from manifest to procfile cf v3 commands don't appear to support commands in manifest files. They say that they do, but in practice they fail on cf v3-zdt-push with an error message "No process types returned from stager". This can be solved by moving the command from the manifest to a Procfile. However, the Procfile is part of the source code, and as such is the same for each app. To get around this, make the Procfile command invoke a new wrapper script, which checks the NOTIFY_APP_NAME env var and then calls the correct command	2019-04-09 17:05:56 +01:00
Athanasios Voutsadakis	3528aab25b	Kill the other processes started by the script We use exec to start awslogs_agent and then a tail to print logs to stdout. CF docs[1] recommend to use exec to start processes which seems to imply that as long as there are commands running the container will remain up and running. This commit ensures that if there are no celery tasks running we will kill any other processes that we have started, so that the container will no longer be considered healthy by cloudfoundry and will be replaced. 1: https://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html#start-commands	2019-01-23 16:23:58 +00:00
Toby Lorne	afcdf1f9a1	Exit if celery processes are not running In `4427827b2f` and celery monitoring was changed from using PID files to actually looking at processes. If celery workers get OOM killed (for instance) the container init script would not restart them, this is because `get_celery_pids` would not contain any processes that contained the string celery. This would cause the pipe to fail (-o pipefail). APP_PIDS would not get updated but the script would continue to run. This caused the script to not restart the celery processes. We think the correct behaviour when celery processes are killed (i.e. there are no more celery processes running in a container) is to kill the container. The PaaS should then schedule new ones which may remediate the cause of the celery processes being killed. Upon detection of no celery processes running, some diagnostic information from the environment is sent to the logs, e.g.: ``` CF_INSTANCE_ADDR=10.0.32.4:61012 CF_INSTANCE_INTERNAL_IP=10.255.184.9 CF_INSTANCE_GUID=81c57dbc-e706-411e-6a5f-2013 CF_INSTANCE_PORT=61012 CF_INSTANCE_IP=10.0.32.4 ``` Then the script (which is the container entrypoint) exits 1. Co-author: @servingupaces @tlwr	2019-01-23 14:28:53 +00:00
Athanasios Voutsadakis	4427827b2f	Handle celery PIDs more reliably This addresses some problems that existed in the previous approach: 1. There was a race condition that could occur between the time we were looking for the existence of the .pid files and actually reading them. 2. If for some reason the .pid file was left behind after a process had died, the script would never know because we do: kill -s ${1} ${APP_PID} \|\| true	2019-01-17 16:55:44 +00:00
Leo Hemsted	80454579ee	fix multi worker exit script previously the script would: try and SIGTERM each celery process every second for the 9 second timeout, and then SIGKILL every second after, with no upper bound. This commit changes this to: * SIGTERM each process once. * Wait nine seconds (checking if the pid files are still present each second) * SIGKILL any remaining processes once. * exit	2019-01-02 15:15:46 +00:00
Rebecca Law	68cea04210	Fixed error message	2018-11-09 16:40:58 +00:00
Leo Hemsted	2ed50e760f	Revert "Celery 4"	2018-10-09 13:27:49 +01:00
Leo Hemsted	bfc4343b0e	remove pip-accel and make sure commands work if you're in a venv remove pip-accel - it's not been updated in two years, and pins our version of pip to a version that is several breaking changes old. make sure commands work if you're already in a venv - mostly by checking for presence of $VIRTUAL_ENV, and ensuring we use the correct pip to install packages. Also clean up the commands a bit.	2018-10-04 15:52:51 +01:00
Leo Hemsted	640f00b0e8	install celery with sqs support you need to `pip install celery[sqs]` to get the additional dependencies that celery needs to use SQS queues - there are two libs - boto3 and pycurl. pycurl is a bunch of python handles around curl, so needs to be installed from source so it can link to your curl/ssl libs. On paas and in docker this works fine (needed to add `libcurl4-openssl-dev` to the docker container), but on macos it can't find openssl. We need to pass a couple of flags in: * set the environment variable PYCURL_SSL_LIBRARY=openssl * pass in the global options `build_ext` and `-I{openssl_headers_path}`. As shown here: https://github.com/pycurl/pycurl/issues/530#issuecomment-395403253 Env var is no biggie, but using any install-option flags disables wheels for the whole pip install run. (See https://github.com/pypa/pip/issues/2677 and https://github.com/pypa/pip/issues/4118 for more context on the install-options flags). A whole bunch of our dependencies don't install nicely from source (but do from wheel), so this commit installs pycurl separately as an initial step, with the requisite flags, and then installs the rest of the requirements as before. I've updated the makefile and bootstrap.sh files to reflect this, but if you run `pip install -r requirements.txt` from scratch you will run into issues.	2018-10-03 14:11:30 +01:00
Alexey Bezhan	75940c9566	Pin all application requirements in requirements.txt The list of top-level dependencies is moved to requirements-app.txt, which is used by `make freeze-requirements` to generate the full list of requirements in requirements.txt. This is based on alphagov/digitalmarketplace-api#615, so rationale from that PR applies here. We had a problem with unpinned packages on new deployments leading to failed tests (e.g. alphagov/notifications-admin#2144) which is why we're implementing this now. After re-evaluating pipenv again, this still seems like the least disruptive approach: * pyup.io has experimental support for Pipfile, but doesn't respect version ranges or updating hashes in the lock file * CloudFoundry buildpack recognizes and supports Pipfiles out of the box, but the support is relatively new. For example until recently CF would install dev packages during deployment. It's also based on generating a requirements file from the Pipfile, which doesn't properly support pinning VCS dependencies (eg it doesn't set the #egg= version, meaning pip will not upgrade the package if it's already installed). * pipenv has a strict dependency resolution algorithm, which doesn't appear to be well documented and can cause some unexpected failures. For example, pipenv doesn't seem to be able to install `awscli-cwlogs` package at all, believing it to have a version conflict for `botocore` (which it doesn't list as a direct dependency) while neither `pip` nor `pip-tools` highlight any issues with it. * While trying out `pipenv install` on our list of dependencies it would regularly fail to install utils with a "Will try again." message. While the installation succeeds after a retry, this doesn't inspire confidence. * The switch to Pipfile and pipenv-managed virtualenvs requires a series of changes to `make` targets and scripts - replacing `pip install` with `pipenv`, removing references to requirements files and prefixing commands with `pipenv run`. While it's likely to simplify the overall process of managing dependencies, it would require time to properly implement across our applications and environments (Jenkins, PaaS, docker containers, and dev machines).	2018-07-10 14:59:04 +01:00
Alexey Bezhan	676e3ec39a	Stream delivery worker logs to stdout when running on PaaS Our application servers and celery workers write logs both to a file that is shipped to CloudWatch and to stdout, which is picked up by CloudFoundry and sent to Logit Logstash. This works with gunicorn and single-worker celery deployments, however celery multi daemonizes worker processes, which detaches them from stdout, so there's no log output in `cf logs` or Logit. To fix this, we start a separate tail process to duplicate logs written to a file to stdout, which should be picked up by CloudFoundry.	2018-06-29 11:49:02 +01:00
Athanasios Voutsadakis	d365921093	Address PR comments Make timeout 9s on both files Use more descriptive variable name Declare the logs dir as a constant	2018-03-06 16:11:42 +00:00
Athanasios Voutsadakis	3045f6233b	Use `continue` instead of `break` on SIGKILL `break`ing would keep trying to kill the same process and never move to the next	2018-02-28 14:50:23 +00:00
Athanasios Voutsadakis	b158c66705	Set timeout to 9 This will allow us an extra second to `kill -9` any remaining processes.	2018-02-28 14:37:02 +00:00
Athanasios Voutsadakis	7d562f9e85	Get empty array when no .pid file exists	2018-02-28 14:36:40 +00:00
Athanasios Voutsadakis	1adee47230	Always get the latest PIDs to check	2018-02-28 14:36:06 +00:00
Athanasios Voutsadakis	138d4eee25	Better logging of PIDs array	2018-02-28 14:34:48 +00:00
Athanasios Voutsadakis	bef80e3414	Use `eval` to run the command instead of exec `exec` is replacing the current shell to run the command, which means that the script execution stops at that line. Passing it to the background with `exec "$@" &` won't work either, because the script will move directly to the next command where it looks for the `.pid` files that have not yet been created because celery takes a few seconds to spin up all the processes. Using `sleep X` to remedy this seems just wrong given that 1. we can use `eval` that blocks until the command returns 2. there is no obvious benefit in sticking with `exec`	2018-02-28 14:16:15 +00:00

1 2 3 4

176 Commits