notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2026-01-10 20:51:14 -05:00

Author	SHA1	Message	Date
Ben Thorner	5206844a95	Merge pull request #3438 from alphagov/lower-query-timeout-180693991 Revert increased timeout for reporting worker	2022-02-16 13:38:29 +00:00
Ben Thorner	2d6ee2eb72	Bump memory for Celery Beat to 512M This fixes the app crashing - apparently due to a lack of free memory. Looking over the last month, we can see it's been hovering at ~90% utilisation. 512M seems like a good compromise to avoid this happening again vs paying for what we use. The limit was last changed 5 years ago: https://github.com/alphagov/notifications-api/pull/882	2022-02-14 14:13:21 +00:00
Ben Thorner	0d71ee69f0	Revert increased timeout for reporting worker This reverts commit `603acc8b1e` + This reverts commit `edad1c9a21`. The cause of the slowness was fixed in [1] and since [2] we now have data to prove it: each query to get the data is taking under 5 minutes, so it's safe to lower the timeout again. [1]: https://github.com/alphagov/notifications-api/pull/3417 [2]: https://github.com/alphagov/notifications-api/pull/3437	2022-01-25 12:50:43 +00:00
Ben Thorner	7ad0c4103a	Stop killing reporting processes after each task Previously we think this setting was necessary to avoid a memory leak [1], but it's unclear if this is still an issue: - We've advanced two major versions of Celery. - Some of the tasks are now quicker and leaner. Restarting worker sub-processes after each task is a big problem for performance, as we move towards parallelising our reporting. This is something of a test to see if we can manage without this setting. Note that we need to unset the variable manually: cf unset-env notify-delivery-worker-reporting CELERYD_MAX_TASKS_PER_CHILD In the worst case we can always re-run any failed tasks. To check the worker is still behaving as expected, we can: - Monitor CPU / memory graphs for it. - Check `cf events` for unexpected restarts / crashes. - Compare numbers of task completion logs to previous days. - Check the number of new billing / status rows looks right. [1]: `ad419f7592`	2022-01-24 12:52:52 +00:00
Richard Baker	ead9814af9	Merge pull request #3412 from alphagov/reduce-db-pool-size Reduce pool size from 30 to 15 connections	2021-12-24 11:40:37 +00:00
David McDonald	edad1c9a21	Bump sqlalchemy statement timeout even higher for reporting worker We saw it fail again last night to calculate how many notifications were sent for one of our services to put in the ft_notification_status table. It ran in to the sqlalchemy statement timeout again. To get us through the holiday period lets make it 2 hours as surely that will be enough and then we can fix this properly	2021-12-24 08:56:42 +00:00
sakisv	ad8cf3f3a6	Reduce pool size from 30 to 15 connections Having a pool size of 30 connections means that if we receive a big number of requests, with the current configuration, the API would end up holding onto 30 connections per worker * 4 workers per instance * 35 instances = 4200 connections. With a limit of 5000 connections, this means that we would only have 800 connections to share between the workers or for overflow usage (btw, even the overflow for the API would take us above the 5000 limit - 10 overflow connections per worker * 4 * 35 = 1400 connections, total 5600 _only_ for the API). During our load tests this led to a deadlock situation where nothing could retrieve connections to deal with a queue build-up. The reduced pool size allowed for a much more graceful degradation of the service where, after significant load we would increase the response times but still manage to serve all the requests.	2021-12-23 19:28:17 +02:00
Rebecca Law	603acc8b1e	Increase the SQL timeout for the `notify-delivery-worker-reporting` app. When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.	2021-12-23 11:41:49 +00:00
Ben Thorner	e1dec3f9b8	Switch to per-app secrets from internal APIs Relates to: [1] [1]: https://github.com/alphagov/notifications-credentials/pull/231	2021-08-05 17:24:56 +01:00
Sakis	4d73a22d48	Set jobs worker memory to 2GB This is to see if the worker requires slightly more memory than it has access to or to determine if there is a memory leak somewhere in the code that needs to be further investigated. This comes at the heels of yesterday's issue that we could not process the CSVs the users uploaded, where the memory graph for this worker showed that it was using almost all of its available memory, so a redeploy fixed the problem.	2021-07-30 12:49:42 +03:00
sakisv	9b34a2a9a2	Add splunk service This will allow shipping app and router logs to splunk[1] This will is only bound on the API because we're only interested in paas router logs for the time being 1: https://github.com/alphagov/paas-csls-splunk-broker/blob/main/docs/user-guide.md	2021-05-14 11:17:26 +03:00
Rebecca Law	f3fdd3b09b	Add internation api key for firetext. We want to start using Firetext for sending international SMS. They require us to use a different API key for international SMS because it requires a new code path to switch the sender ID to something that the country will accept. This PR does not include switching the sender of international SMS to Firetext but sets us up to do so.	2021-04-20 13:58:55 +01:00
David McDonald	dbc947e900	Remove api-sms-callbacks app We no longer need this as it has been replaced by the api-sms-receipts app which is the new app with the correct name but no functionality change.	2021-04-16 11:30:01 +01:00
David McDonald	39d7c347b0	Add sms receipts app manifest config This app will replace the `notify-api-sms-callbacks` app as it is an app that handles receipts, not callbacks. After this and the corresponding concourse PR is merged and deployed (at which point we will have two apps sharing the traffic) we can then put in PRs to remove the `notify-api-sms-callbacks` app. There is a chance it could all be done as a single PR (or at least one for the API and one for the concourse pipelines) but I'm playing it safe and doing it as a very clear two step process just in case.	2021-04-13 18:03:05 +01:00
David McDonald	4437d60dd7	Only give broadcasts worker IAM creds for CBC proxy There is no need to give it to any of the other workers and so the fewer instances that have these creds the better. You can verify this works by running ``` CF_APP=notify-api CF_SPACE=preview make generate-manifest ``` vs ``` CF_APP=notify-delivery-worker-broadcasts CF_SPACE=preview make generate-manifest ```	2021-04-12 17:05:42 +01:00
Katie Smith	32f499c802	Fix setting the CELERYD_PREFETCH_MULTIPLIER variable for broadcasts This was not being set correctly in the manifest for the notify-delivery-worker-broadcasts worker.	2021-04-09 11:54:31 +01:00
David McDonald	41d95378ea	Remove everything for the performance platform We no longer will send them any stats so therefore don't need the code - the code to work out the nightly stats - the performance platform client - any configuration for the client - any nightly tasks that kick off the sending off the stats We will require a change in cronitor as we no longer will have this task run meaning we need to delete the cronitor check.	2021-03-15 12:04:53 +00:00
Rebecca Law	acfb759cb9	Change DVLA_EMAIL_ADDRESS to a list	2021-02-26 11:21:16 +00:00
Pea Tyczynska	8e3ef5ff05	Add DVLA_EMAIL_ADDRESS to manifest so it gets picked up from credentials.	2021-02-24 10:32:20 +00:00
Leo Hemsted	4f89be6944	Revert "Merge pull request #3125 from alphagov/revert-retry" This reverts commit `6b9a50beff`, reversing changes made to `33f93dfea2`.	2021-02-09 17:01:04 +00:00
Leo Hemsted	49e6ec1ead	Revert "Merge pull request #3123 from alphagov/retry-loop-fix" This reverts commit `541a765811`, reversing changes made to `6a9ac654a6`.	2021-02-08 11:01:33 +00:00
Leo Hemsted	0ddebc63a8	reduce broadcast retry delay to 4 mins and drop prefetch. ### The facts * Celery grabs up to 10 tasks from an SQS queue by default * Each broadcast task takes a couple of seconds to execute, or double that if it has to go to the failover proxy * Broadcast tasks delay retry exponentially, up to 300 seconds. * Tasks are acknowledged when celery starts executing them. * If a task is not acknowledged before its visibility timeout of 310 seconds, sqs assumes the celery app has died, and puts it back on the queue. ### The situation A task stuck in a retry loop was reaching its visbility timeout, and as such SQS was duplicating it. We're unsure of the exact cause of reaching its visibility timeout, but there were two contributing factors: The celery prefetch and the delay of 300 seconds. Essentially, celery grabs the task, keeps an eye on it locally while waiting for the delay ETA to come round, then gives the task to a worker to do. However, that worker might already have up to ten tasks that it's grabbed from SQS. This means the worker only has 10 seconds to get through all those tasks and start working on the delayed task, before SQS moves the task back into available. (Note that the delay of 300 seconds is translated into a timestamp based on the time you called self.retry and put the task back on the queue. Whereas the visibility timeout starts ticking from the time that a celery worker picked up the task.) ### The fix #### Set the max retry delay for broadcast tasks to 240 seconds Setting the max delay to 240 seconds means that instead of a 10 second buffer before the visibility timeout is tripped, we've got a 70 second buffer. #### Set the prefetch limit to 1 for broadcast workers This means that each worker will have up to 1 currently executing task, and 1 task pending execution. If it has these, it won't grab any more off the queue, so they can sit there without their visibility timeout ticking up. Setting a prefetch limit to 1 will result in more queries to SQS and a lower throughput. This might be relevant in, eg, sending emails. But the broadcast worker is not hyper-time critical. https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time	2021-02-05 12:49:51 +00:00
David McDonald	78db0f9c2b	Add broadcasts worker and queue This worker will be responsible for handing all broadcasts tasks. It is based on the internal worker which is currently handling broadcast tasks. Concurrency of 2 has been chosen fairly arbitrarily. Gunicorn will be running 4 worker processes so we will end up with the ability to process 8 tasks per app instance given this.	2021-01-13 16:35:27 +00:00
David McDonald	1ac3ca250c	Add more memory for the sender and letter workers On monday, we had a build of emails in the email queue that weren't getting picked up by the sender worker and causing delays. After further investigation with Andy from the PaaS, we believe the following happened. We received a bunch of traffic at 8:30ish which consisted of some very large emails in terms of their length and complexity. The amount of memory used by the app instances got very high and a few apps crashed due to OOM (recorded by 5 cf app event crashes). When new app instances tried to spin up, they weren't able to as they potentially also ran out of memory immediately. This left us in the position of having fewer app instances than we needed, on top of which they were all using a very large amount of CPU and may have been limited how quickly an individual app instance would process tasks. This meant that we were overall processing fewer tasks then we needed to and our queue of emails started to build up. So it appears our sender workers did not have the memory available that they needed. By looking at a graph for the past 30 days of memory usage on the sender workers, we see that it on several days breached 90% memory usage for long periods of time. This in combination of the hypothesis above of what happened leads us to decide that we want to give the app instances a bigger memory quota so it has been upped from 3GB to 4GB. Whilst doing, I also looked at long term memory usage graphs for our other workers and saw that the letters worker was similarly close to around 90% of memory used so have taken the opportunity to bump that too.	2020-12-24 15:03:39 +00:00
Toby Lorne	a3293d3c8c	manifest: add cbc proxy env vars Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk> Co-authored-by: Katie <katie.smith@digital.cabinet-office.gov.uk>	2020-10-20 16:59:50 +01:00
Leo Hemsted	9d5b629dad	add notify-api-sms-callbacks app we might need to deal with a potential large volume of SMS delivery receipts. These receipts are POSTed to oru public api to two URLs. The actual endpoint just parses the response body and puts a task on an SQS queue - no database connections are required or anything like that. Split up this traffic from other traffic, so that any increase in volume of callbacks won't affect the scaling/load/etc of the main api apps. We've hard coded instance counts to 10 on prod for now until we get an idea of load.	2020-09-24 18:00:25 +01:00
Pea Tyczynska	fff004da2b	Turn statsd back on, but not for api main app, only for delivery workers.	2020-07-09 10:10:24 +01:00
David McDonald	a237162106	Reduce concurrency and prefetch count of reporting celery app We have seen the reporting app run out of memory multiple times when dealing with overnight tasks. The app runs 11 worker threads and we reduce this to 2 worker threads to put less pressure on a single instance. The number 2 was chosen as most of the tasks processed by the reporting app only take a few minutes and only one or two usually take more than an hour. This would mean with 2 processes across our current 2 instances, a long running task should hopefully only wait behind a few short running tasks before being picked up and therefore we shouldn't see large increase in overall time taken to run all our overnight reporting tasks. On top of reducing the concurrency for the reporting app, we also set CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery docs because this app deals with long running tasks. https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit The chance in prefetch multiplier should again optimise the overall time it takes to process our tasks by ensuring that tasks are given to instances that have (or will soon have) spare workers to deal with them, rather than committing to putting all the tasks on certain workers in advance. Note, another suggestion for improving suggested by the docs for optimising is to start setting `ACKS_LATE` on the long running tasks. This setting would effectively change us from prefetching 1 task per worker to prefetching 0 tasks per worker and further optimise how we distribute our tasks across instances. However, we decided not to try this setting as we weren't sure whether it would conflict with our visibility_timeout. We decided not to spend the time investigating but it may be worth revisiting in the future, as long as tasks are idempotent. Overall, this commit takes us from potentially having all 18 of our reporting tasks get fetched onto a single instance to now having a process that will ensure tasks are distributed more fairly across instances based on when they have available workers to process the tasks.	2020-04-28 10:47:46 +01:00
Leo Hemsted	ad419f7592	restart reporting worker after each task The reporting worker tasks fetch large amounts of data from the db, do some processing then store back in the database. As the reporting worker only processes the create nightly billing/stats table tasks, which aren't high performance or high volume, we're fine with the performance hit from restarting the worker between every task (which based on limited local testing takes about a second or so). This causes some real funky shit with the app_context (used for accessing current_app.logger). To access flask's global state we use the standard way of importing `from flask import current_app`. However, processes after the first one don't have the current_app available on shut down (they're fine during the actual task running), and are unable to call `with current_app.app_context()` to create it. They _are_ able to call `with app.app_context()` to create it, where `app` is the initial app that we pass in to `NotifyCelery.init_app`. NotifyCelery.init_app is only called once, in the master process - I think the application state is then stored and passed to the celery workers. But then it looks like the teardown might clear it, but it never gets set up again for the new workers? Unsure. To fix this, store a copy of the initial flask app on the NotifyCelery object and then use that from within the shutdown signal logging function. Nothing's ever easy ¯\_(ツ)_/¯	2020-04-24 12:28:25 +01:00
David McDonald	1ff52bbaad	Add GDSMetrics package As per instructions https://github.com/alphagov/gds_metrics_python The celery workers don't have an HTTP endpoint so no point in trying to get prometheus to scrape them.	2020-04-20 18:39:45 +01:00
Rebecca Law	1d7c3466b0	Add cred to manifest	2020-03-27 10:08:30 +00:00
Rebecca Law	a13bcc6697	Reduce the pressure on the db for API post email requests. Instead of saving the email notification to the db add it to a queue to save later. This is an attempt to alleviate pressure on the db from the api requests. This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id. In a nutshell: - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db. - create a save_api_email task to persist the notification - return the notification - New worker app to process the save_api_email tasks.	2020-03-25 07:59:05 +00:00
Rebecca Law	b30deaa989	Increase the queue pool size to 30. This should add an extra 400 connections maximum, which will not tip us over the allowable 5000 db connections. And it may help with the queue pool connection errors.	2020-03-16 16:46:19 +00:00
David McDonald	f56795655e	Remove unused STATSD_PREFIX variable We moved from sending statsd metrics to hosted graphite to sending to one that is running on the paas. Therefore we no longer need to send statsd metrics to a particular prefix at the statsd app as it is only receiving statsd metrics from our apps (not other users like would have been the case with HostedGraphite). This should change no behaviour as the only place the environment variable was being used was in the gunicorn config and it was an empty string which is the default behaviour anyway as per: https://docs.gunicorn.org/en/stable/settings.html#statsd-prefix	2020-03-05 10:41:26 +00:00
David McDonald	a13b17a7c9	Transform list to json so it can be used read by json.loads	2020-02-21 13:41:08 +00:00
David McDonald	2dc5550159	Change variable name to make more descriptive Also remove unnecessary if statement Also add manifest change to make sure relevant environment variables makes it into the app	2020-02-20 15:48:15 +00:00
David McDonald	9de3a9ff43	Set healthcheck timeout as integer Not allowed to be a non integer so have upped to 3 rather than going to 2 (fairly arbitrary choice).	2019-12-19 10:55:38 +00:00
David McDonald	5aaf109ae1	Up API health check timeout to 2.5 seconds This was after we saw an instance of the API failing it's healthcheck even though it was still healthy enough to serve requests to users. This follows the change we've also made to template-preview and admin of upping the health check timeout. Unlike those where we set it to be 10 seconds, we have been less allowing here and only chosen 2.5 seconds. This was at suggestion of Toby from PaaS as the api should generally have quicker response times and more annoyance might be created for users if we let an instance stick around for 10 seconds where it was unable to serve requests successfully.	2019-12-18 10:36:04 +00:00
Leo Hemsted	4701e5d9af	don't define MMG_URL and FIRETEXT_URL in manifest these URLs never change, and it lead to surprising issues where an updated default MMG_URL wasn't actually respected on PaaS. These urls aren't private and don't need to be stored in credentials. By not defining them in the manifest, we expect them to use the default unless `cf set-env` has been specifically used to modify them in an app.	2019-12-04 15:26:49 +00:00
Leo Hemsted	e094dd4bfd	remove loadtesting from providers we don't use it since we wrote our own provider stubs for performance tests. this removes it from the api - it's still in the DB and will be retrieved by queries, but is set to disabled on prod	2019-10-23 11:45:07 +01:00
Leo Hemsted	9c2ded00c1	scale up api to 25 instances before deploy on production deploys take up to five minutes, during which notify-paas-autoscaler can't scale the app. We saw 502s due to a large volume of traffic coming in during that time, and we couldn't react cos we were deploying. scale up to 25 instances, the autoscaler won't be able to downscale until after the deploy has finished.	2019-10-14 16:12:35 +01:00
Leo Hemsted	3a0bf2b23e	Add reporting worker also remove references to unused statistics queue	2019-08-15 16:42:15 +01:00
Athanasios Voutsadakis	cd936d2e71	Enable statsd exporter for production Also bump the utils version to include a fix on the error handling logic when we fail to send a metric.	2019-08-14 11:42:13 +01:00
Andy Paine	088f234185	REP-340: Use PaaS statsd exporter - We are running the statsd-exporter on the PaaS now so we can use the internal UDP route to talk to it - Only update in preview and staging still so that we can get the dashboards fully up to date before switching prod	2019-08-05 10:36:58 +01:00
Andy Paine	57705fd6fe	AUTO: Explicitly include FIRETEXT_URL in manifest - We are explicit about MMG_URL but not FIRETEXT_URL - credentials has already been updated (checked by doing make generate-manifest for all envs)	2019-06-14 15:22:18 +01:00
Andy Paine	2d17827780	AUTO: Enable statsd exporter on staging - We want to do some load testing so we want to use the Prometheus metrics for observing the system - Roll out the statsd exporter work to staging too	2019-06-10 11:12:44 +01:00
Andy Paine	e61619f3e0	AUTO-413: Point preview statsd at tools - We are running a statsd exporter on tools to collect all our statsd metrics for scraping by Prometheus - Update preview to point there instead of at the local one which has issues with redeployment and DNS changing	2019-05-30 17:03:08 +01:00
Andy Paine	adf81ef689	BAU: Use port health checks for API - We've been seeing an issue when traffic spikes of the http health checks taking over 1s and PaaS killing the app - Port health checks won't care about being stuck in a queue so should continue to work even at high loads - We have functional tests to catch if a deployment brings up the app (and so passes port health check) but then doesn't work	2019-05-30 11:56:19 +01:00
Andy Paine	655d5a4e16	AUTO-413: Use an internal app for statsd preview - We are running statsd exporter as an app with a public route for Prometheus to scrape - This updates preview to send statsd metrics over the CF internal networking to the statsd exporter - Removes the sidecar statsd exporters too	2019-05-23 11:10:33 +01:00
Leo Hemsted	10a6f32a09	add routes for all apps all apps get a route assigned when using v3-zdt-push. > By default, the web process has a route and one instance. Other processes have zero instances by default. ([source](https://docs.cloudfoundry.org/devguide/multiple-processes.html)) When we push apps to multiple environments they need different routes or the second push will fail, so this means that we need to define routes ourselves for every app. We're also manually flagging the health-check as either "http" or "process" - http for the api, process for all others. If not specified, healthcheck is set to `port` by cloudfoundry - we've seen some issues with upgrading the deployment from v2 to v3 when using port - it adds apps to load balancer when they're not ready, which can result in 404s. by setting healthcheck to http it'll wait for the /status endpoint to return 200, which will wait for flask to get everything up and running properly	2019-05-15 16:01:28 +01:00

1 2

61 Commits