Commit Graph

61 Commits

Author SHA1 Message Date
Ben Thorner
5206844a95 Merge pull request #3438 from alphagov/lower-query-timeout-180693991
Revert increased timeout for reporting worker
2022-02-16 13:38:29 +00:00
Ben Thorner
2d6ee2eb72 Bump memory for Celery Beat to 512M
This fixes the app crashing - apparently due to a lack of free
memory. Looking over the last month, we can see it's been hovering
at ~90% utilisation. 512M seems like a good compromise to avoid
this happening again vs paying for what we use.

The limit was last changed 5 years ago:

https://github.com/alphagov/notifications-api/pull/882
2022-02-14 14:13:21 +00:00
Ben Thorner
0d71ee69f0 Revert increased timeout for reporting worker
This reverts commit 603acc8b1e +
This reverts commit edad1c9a21.

The cause of the slowness was fixed in [1] and since [2] we now
have data to prove it: each query to get the data is taking under
5 minutes, so it's safe to lower the timeout again.

[1]: https://github.com/alphagov/notifications-api/pull/3417
[2]: https://github.com/alphagov/notifications-api/pull/3437
2022-01-25 12:50:43 +00:00
Ben Thorner
7ad0c4103a Stop killing reporting processes after each task
Previously we think this setting was necessary to avoid a memory
leak [1], but it's unclear if this is still an issue:

- We've advanced two major versions of Celery.
- Some of the tasks are now quicker and leaner.

Restarting worker sub-processes after each task is a big problem
for performance, as we move towards parallelising our reporting.

This is something of a test to see if we can manage without this
setting. Note that we need to unset the variable manually:

   cf unset-env notify-delivery-worker-reporting CELERYD_MAX_TASKS_PER_CHILD

In the worst case we can always re-run any failed tasks. To check
the worker is still behaving as expected, we can:

- Monitor CPU / memory graphs for it.
- Check `cf events` for unexpected restarts / crashes.
- Compare numbers of task completion logs to previous days.
- Check the number of new billing / status rows looks right.

[1]: ad419f7592
2022-01-24 12:52:52 +00:00
Richard Baker
ead9814af9 Merge pull request #3412 from alphagov/reduce-db-pool-size
Reduce pool size from 30 to 15 connections
2021-12-24 11:40:37 +00:00
David McDonald
edad1c9a21 Bump sqlalchemy statement timeout even higher for reporting worker
We saw it fail again last night to calculate how many notifications
were sent for one of our services to put in the ft_notification_status
table. It ran in to the sqlalchemy statement timeout again.
To get us through the holiday
period lets make it 2 hours as surely that will be enough and then
we can fix this properly
2021-12-24 08:56:42 +00:00
sakisv
ad8cf3f3a6 Reduce pool size from 30 to 15 connections
Having a pool size of 30 connections means that if we receive a big
number of requests, with the current configuration, the API would end up
holding onto 30 connections per worker * 4 workers per instance * 35
instances = 4200 connections. With a limit of 5000 connections, this
means that we would only have 800 connections to share between the
workers or for overflow usage (btw, even the overflow for the API would
take us above the 5000 limit - 10 overflow connections per worker * 4 *
35 = 1400 connections, total 5600 _only_ for the API).

During our load tests this led to a deadlock situation where nothing
could retrieve connections to deal with a queue build-up.

The reduced pool size allowed for a much more graceful degradation of
the service where, after significant load we would increase the response
times but still manage to serve all the requests.
2021-12-23 19:28:17 +02:00
Rebecca Law
603acc8b1e Increase the SQL timeout for the notify-delivery-worker-reporting app.
When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.
2021-12-23 11:41:49 +00:00
Ben Thorner
e1dec3f9b8 Switch to per-app secrets from internal APIs
Relates to: [1]

[1]: https://github.com/alphagov/notifications-credentials/pull/231
2021-08-05 17:24:56 +01:00
Sakis
4d73a22d48 Set jobs worker memory to 2GB
This is to see if the worker requires slightly more memory than it has access to or to determine if there is a memory  leak somewhere in the code that needs to be further investigated.

This comes at the heels of yesterday's issue that we could not process the CSVs the users uploaded, where the memory graph for this worker showed that it was using almost all of its available memory, so a redeploy fixed the problem.
2021-07-30 12:49:42 +03:00
sakisv
9b34a2a9a2 Add splunk service
This will allow shipping app and router logs to splunk[1]

This will is only bound on the API because we're only interested in
paas router logs for the time being

1: https://github.com/alphagov/paas-csls-splunk-broker/blob/main/docs/user-guide.md
2021-05-14 11:17:26 +03:00
Rebecca Law
f3fdd3b09b Add internation api key for firetext.
We want to start using Firetext for sending international SMS. They
require us to use a different API key for international SMS because it
requires a new code path to switch the sender ID to something that the
country will accept.
This PR does not include switching the sender of international SMS to
Firetext but sets us up to do so.
2021-04-20 13:58:55 +01:00
David McDonald
dbc947e900 Remove api-sms-callbacks app
We no longer need this as it has been replaced by the api-sms-receipts
app which is the new app with the correct name but no functionality
change.
2021-04-16 11:30:01 +01:00
David McDonald
39d7c347b0 Add sms receipts app manifest config
This app will replace the `notify-api-sms-callbacks` app as it is an app
that handles receipts, not callbacks.

After this and the corresponding concourse PR is merged and deployed (at which
point we will have two apps sharing the traffic) we can then put in PRs
to remove the `notify-api-sms-callbacks` app.

There is a chance it could all be done as a single PR (or at least one
for the API and one for the concourse pipelines) but I'm playing it safe
and doing it as a very clear two step process just in case.
2021-04-13 18:03:05 +01:00
David McDonald
4437d60dd7 Only give broadcasts worker IAM creds for CBC proxy
There is no need to give it to any of the other workers and so the fewer
instances that have these creds the better.

You can verify this works by running
```
CF_APP=notify-api CF_SPACE=preview make generate-manifest
```

vs

```
CF_APP=notify-delivery-worker-broadcasts CF_SPACE=preview make generate-manifest
```
2021-04-12 17:05:42 +01:00
Katie Smith
32f499c802 Fix setting the CELERYD_PREFETCH_MULTIPLIER variable for broadcasts
This was not being set correctly in the manifest for the
notify-delivery-worker-broadcasts worker.
2021-04-09 11:54:31 +01:00
David McDonald
41d95378ea Remove everything for the performance platform
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats

We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
2021-03-15 12:04:53 +00:00
Rebecca Law
acfb759cb9 Change DVLA_EMAIL_ADDRESS to a list 2021-02-26 11:21:16 +00:00
Pea Tyczynska
8e3ef5ff05 Add DVLA_EMAIL_ADDRESS to manifest so it gets picked up from
credentials.
2021-02-24 10:32:20 +00:00
Leo Hemsted
4f89be6944 Revert "Merge pull request #3125 from alphagov/revert-retry"
This reverts commit 6b9a50beff, reversing
changes made to 33f93dfea2.
2021-02-09 17:01:04 +00:00
Leo Hemsted
49e6ec1ead Revert "Merge pull request #3123 from alphagov/retry-loop-fix"
This reverts commit 541a765811, reversing
changes made to 6a9ac654a6.
2021-02-08 11:01:33 +00:00
Leo Hemsted
0ddebc63a8 reduce broadcast retry delay to 4 mins and drop prefetch.
### The facts

* Celery grabs up to 10 tasks from an SQS queue by default
* Each broadcast task takes a couple of seconds to execute, or double
  that if it has to go to the failover proxy
* Broadcast tasks delay retry exponentially, up to 300 seconds.
* Tasks are acknowledged when celery starts executing them.
* If a task is not acknowledged before its visibility timeout of 310
  seconds, sqs assumes the celery app has died, and puts it back on the
  queue.

### The situation

A task stuck in a retry loop was reaching its visbility timeout, and as
such SQS was duplicating it. We're unsure of the exact cause of reaching
its visibility timeout, but there were two contributing factors: The
celery prefetch and the delay of 300 seconds. Essentially, celery grabs
the task, keeps an eye on it locally while waiting for the delay ETA to
come round, then gives the task to a worker to do. However, that worker
might already have up to ten tasks that it's grabbed from SQS. This
means the worker only has 10 seconds to get through all those tasks and
start working on the delayed task, before SQS moves the task back into
available.

(Note that the delay of 300 seconds is translated into a timestamp based
on the time you called self.retry and put the task back on the queue.
Whereas the visibility timeout starts ticking from the time that a
celery worker picked up the task.)

### The fix

#### Set the max retry delay for broadcast tasks to 240 seconds

Setting the max delay to 240 seconds means that instead of a 10 second
buffer before the visibility timeout is tripped, we've got a 70 second
buffer.

#### Set the prefetch limit to 1 for broadcast workers

This means that each worker will have up to 1 currently executing task,
and 1 task pending execution. If it has these, it won't grab any more
off the queue, so they can sit there without their visibility timeout
ticking up.

Setting a prefetch limit to 1 will result in more queries to SQS and a
lower throughput. This might be relevant in, eg, sending emails. But the
broadcast worker is not hyper-time critical.

https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time
2021-02-05 12:49:51 +00:00
David McDonald
78db0f9c2b Add broadcasts worker and queue
This worker will be responsible for handing all broadcasts tasks.

It is based on the internal worker which is currently handling broadcast
tasks.

Concurrency of 2 has been chosen fairly arbitrarily. Gunicorn will be
running 4 worker processes so we will end up with the ability to process
8 tasks per app instance given this.
2021-01-13 16:35:27 +00:00
David McDonald
1ac3ca250c Add more memory for the sender and letter workers
On monday, we had a build of emails in the email queue that weren't
getting picked up by the sender worker and causing delays.

After further investigation with Andy from the PaaS, we believe the
following happened.

We received a bunch of traffic at 8:30ish which consisted of some
very large emails in terms of their length and complexity. The amount
of memory used by the app instances got very high and a few apps
crashed due to OOM (recorded by 5 cf app event crashes). When new
app instances tried to spin up, they weren't able to as they
potentially also ran out of memory immediately.

This left us in the position of having fewer app instances than we
needed, on top of which they were all using a very large amount of
CPU and may have been limited how quickly an individual app
instance would process tasks. This meant that we were overall
processing fewer tasks then we needed to and our queue of emails
started to build up.

So it appears our sender workers did not have the memory available that
they needed. By looking at a graph for the past 30 days of memory usage
on the sender workers, we see that it on several days breached 90%
memory usage for long periods of time. This in combination of the
hypothesis above of what happened leads us to decide that we want to
give the app instances a bigger memory quota so it has been upped from
3GB to 4GB.

Whilst doing, I also looked at long term memory usage graphs for our
other workers and saw that the letters worker was similarly close to
around 90% of memory used so have taken the opportunity to bump that
too.
2020-12-24 15:03:39 +00:00
Toby Lorne
a3293d3c8c manifest: add cbc proxy env vars
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Katie <katie.smith@digital.cabinet-office.gov.uk>
2020-10-20 16:59:50 +01:00
Leo Hemsted
9d5b629dad add notify-api-sms-callbacks app
we might need to deal with a potential large volume of SMS delivery
receipts. These receipts are POSTed to oru public api to two URLs. The
actual endpoint just parses the response body and puts a task on an SQS
queue - no database connections are required or anything like that.

Split up this traffic from other traffic, so that any increase in volume
of callbacks won't affect the scaling/load/etc of the main api apps.
We've hard coded instance counts to 10 on prod for now until we get an
idea of load.
2020-09-24 18:00:25 +01:00
Pea Tyczynska
fff004da2b Turn statsd back on, but not for api main app, only for delivery
workers.
2020-07-09 10:10:24 +01:00
David McDonald
a237162106 Reduce concurrency and prefetch count of reporting celery app
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.

The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.

On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit

The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.

Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.

Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
2020-04-28 10:47:46 +01:00
Leo Hemsted
ad419f7592 restart reporting worker after each task
The reporting worker tasks fetch large amounts of data from the db, do
some processing then store back in the database. As the reporting worker
only processes the create nightly billing/stats table tasks, which
aren't high performance or high volume, we're fine with the performance
hit from restarting the worker between every task (which based on
limited local testing takes about a second or so).

This causes some real funky shit with the app_context (used for
accessing current_app.logger). To access flask's global state we use the
standard way of importing `from flask import current_app`. However,
processes after the first one don't have the current_app available on
shut down (they're fine during the actual task running), and are unable
to call `with current_app.app_context()` to create it. They _are_ able
to call `with app.app_context()` to create it, where `app` is the
initial app that we pass in to `NotifyCelery.init_app`.

NotifyCelery.init_app is only called once, in the master process - I
think the application state is then stored and passed to the celery
workers. But then it looks like the teardown might clear it, but it
never gets set up again for the new workers? Unsure.

To fix this, store a copy of the initial flask app on the NotifyCelery
object and then use that from within the shutdown signal logging
function.

Nothing's ever easy ¯\_(ツ)_/¯
2020-04-24 12:28:25 +01:00
David McDonald
1ff52bbaad Add GDSMetrics package
As per instructions https://github.com/alphagov/gds_metrics_python

The celery workers don't have an HTTP endpoint so no point in trying to
get prometheus to scrape them.
2020-04-20 18:39:45 +01:00
Rebecca Law
1d7c3466b0 Add cred to manifest 2020-03-27 10:08:30 +00:00
Rebecca Law
a13bcc6697 Reduce the pressure on the db for API post email requests.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.

In a nutshell:
 - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
 - create a save_api_email task to persist the notification
 - return the notification
 - New worker app to process the save_api_email tasks.
2020-03-25 07:59:05 +00:00
Rebecca Law
b30deaa989 Increase the queue pool size to 30.
This should add an extra 400 connections maximum, which will not tip us over the allowable 5000 db connections. And it may help with the queue pool connection errors.
2020-03-16 16:46:19 +00:00
David McDonald
f56795655e Remove unused STATSD_PREFIX variable
We moved from sending statsd metrics to hosted graphite to sending to
one that is running on the paas. Therefore we no longer need to send
statsd metrics to a particular prefix at the statsd app as it is only
receiving statsd metrics from our apps (not other users like would have
been the case with HostedGraphite).

This should change no behaviour as the only place the environment
variable was being used was in the gunicorn config and it was an empty
string which is the default behaviour anyway as per:
https://docs.gunicorn.org/en/stable/settings.html#statsd-prefix
2020-03-05 10:41:26 +00:00
David McDonald
a13b17a7c9 Transform list to json so it can be used read by json.loads 2020-02-21 13:41:08 +00:00
David McDonald
2dc5550159 Change variable name to make more descriptive
Also remove unnecessary if statement
Also add manifest change to make sure relevant environment variables
makes it into the app
2020-02-20 15:48:15 +00:00
David McDonald
9de3a9ff43 Set healthcheck timeout as integer
Not allowed to be a non integer so have upped to 3 rather than going to
2 (fairly arbitrary choice).
2019-12-19 10:55:38 +00:00
David McDonald
5aaf109ae1 Up API health check timeout to 2.5 seconds
This was after we saw an instance of the API failing it's healthcheck
even though it was still healthy enough to serve requests to users.

This follows the change we've also made to template-preview and admin of
upping the health check timeout. Unlike those where we set it to be 10
seconds, we have been less allowing here and only chosen 2.5 seconds.
This was at suggestion of Toby from PaaS as the api should generally
have quicker response times and more annoyance might be created for
users if we let an instance stick around for 10 seconds where it was
unable to serve requests successfully.
2019-12-18 10:36:04 +00:00
Leo Hemsted
4701e5d9af don't define MMG_URL and FIRETEXT_URL in manifest
these URLs never change, and it lead to surprising issues where an
updated default MMG_URL wasn't actually respected on PaaS. These urls
aren't private and don't need to be stored in credentials.

By not defining them in the manifest, we expect them to use the default
unless `cf set-env` has been specifically used to modify them in an app.
2019-12-04 15:26:49 +00:00
Leo Hemsted
e094dd4bfd remove loadtesting from providers
we don't use it since we wrote our own provider stubs for performance
tests.

this removes it from the api - it's still in the DB and will be
retrieved by queries, but is set to disabled on prod
2019-10-23 11:45:07 +01:00
Leo Hemsted
9c2ded00c1 scale up api to 25 instances before deploy on production
deploys take up to five minutes, during which notify-paas-autoscaler
can't scale the app. We saw 502s due to a large volume of traffic
coming in during that time, and we couldn't react cos we were
deploying.

scale up to 25 instances, the autoscaler won't be able to downscale
until after the deploy has finished.
2019-10-14 16:12:35 +01:00
Leo Hemsted
3a0bf2b23e Add reporting worker
also remove references to unused statistics queue
2019-08-15 16:42:15 +01:00
Athanasios Voutsadakis
cd936d2e71 Enable statsd exporter for production
Also bump the utils version to include a fix on the error handling logic
when we fail to send a metric.
2019-08-14 11:42:13 +01:00
Andy Paine
088f234185 REP-340: Use PaaS statsd exporter
- We are running the statsd-exporter on the PaaS now so we can use the
  internal UDP route to talk to it
- Only update in preview and staging still so that we can get the
  dashboards fully up to date before switching prod
2019-08-05 10:36:58 +01:00
Andy Paine
57705fd6fe AUTO: Explicitly include FIRETEXT_URL in manifest
- We are explicit about MMG_URL but not FIRETEXT_URL
- credentials has already been updated (checked by doing make
  generate-manifest for all envs)
2019-06-14 15:22:18 +01:00
Andy Paine
2d17827780 AUTO: Enable statsd exporter on staging
- We want to do some load testing so we want to use the Prometheus
  metrics for observing the system
- Roll out the statsd exporter work to staging too
2019-06-10 11:12:44 +01:00
Andy Paine
e61619f3e0 AUTO-413: Point preview statsd at tools
- We are running a statsd exporter on tools to collect all our statsd
  metrics for scraping by Prometheus
- Update preview to point there instead of at the local one which has
  issues with redeployment and DNS changing
2019-05-30 17:03:08 +01:00
Andy Paine
adf81ef689 BAU: Use port health checks for API
- We've been seeing an issue when traffic spikes of the http health
  checks taking over 1s and PaaS killing the app
- Port health checks won't care about being stuck in a queue so should
  continue to work even at high loads
- We have functional tests to catch if a deployment brings up the app
  (and so passes port health check) but then doesn't work
2019-05-30 11:56:19 +01:00
Andy Paine
655d5a4e16 AUTO-413: Use an internal app for statsd preview
- We are running statsd exporter as an app with a public route for
  Prometheus to scrape
- This updates preview to send statsd metrics over the CF internal
  networking to the statsd exporter
- Removes the sidecar statsd exporters too
2019-05-23 11:10:33 +01:00
Leo Hemsted
10a6f32a09 add routes for all apps
all apps get a route assigned when using v3-zdt-push.

> By default, the web process has a route and one instance. Other processes have zero instances by default.

([source](https://docs.cloudfoundry.org/devguide/multiple-processes.html))

When we push apps to multiple environments they need different routes
or the second push will fail, so this means that we need to define
routes ourselves for every app.

We're also manually flagging the health-check as either "http" or
"process" - http for the api, process for all others.

If not specified, healthcheck is set to `port` by cloudfoundry - we've
seen some issues with upgrading the deployment from v2 to v3 when using
port - it adds apps to load balancer when they're not ready, which can
result in 404s. by setting healthcheck to http it'll wait for the
/status endpoint to return 200, which will wait for flask to get
everything up and running properly
2019-05-15 16:01:28 +01:00