Previously we think this setting was necessary to avoid a memory
leak [1], but it's unclear if this is still an issue:
- We've advanced two major versions of Celery.
- Some of the tasks are now quicker and leaner.
Restarting worker sub-processes after each task is a big problem
for performance, as we move towards parallelising our reporting.
This is something of a test to see if we can manage without this
setting. Note that we need to unset the variable manually:
cf unset-env notify-delivery-worker-reporting CELERYD_MAX_TASKS_PER_CHILD
In the worst case we can always re-run any failed tasks. To check
the worker is still behaving as expected, we can:
- Monitor CPU / memory graphs for it.
- Check `cf events` for unexpected restarts / crashes.
- Compare numbers of task completion logs to previous days.
- Check the number of new billing / status rows looks right.
[1]: ad419f7592
we try and delete for lots of services. this includes services that
don't actually have anything to delete that day. that might be because
they had a custom data retention so we always go to check them, or
because they only sent test notifications (which we'll delete but not
include in the count in the log line). we don't really need to see log
lines saying that we didn't delete anything for that service - that's
just a long list of boring log messages that will hide the actual
interesting stuff - which services we did delete content for.
## The existing situation
To support multiple processes and eventlets recording metrics in
parallel, prometheus uses files to store metrics. When you write a
metric from a multiprocess app, it writes to a file.
Prometheus identifies whether your app is multiprocess by looking for
the existence of a `prometheus_multiproc_dir` environment var (in either
case). Prometheus reads this variable at a module level (ie: at import
time). Assuming it will always used within a web server, the gds_metrics
library auto-sets this to `/tmp` on import, to ensure that prometheus
will always be set up correctly.
We also have a variety of metrics set up when we create the app. These
are generally sensible metrics such as counting the number of database
connections in use by measuring sqlalchemy connection events.
## The problem
We have seen problems with our notify-delivery-worker-reporting app run
out of space. The CELERYD_MAX_TASKS_PER_CHILD flag is set on that app
which restarts each worker process every time a task runs (to avoid
memory issues), however we've recently massively decreased the size and
increased the number of tasks to parallelise nightly tasks. Each time a
worker process restarts it will write a new file to disk. This meant
that we quickly ran out of disc space, and then the entire app instance
was killed.
The big rub is that we don't log prometheus metrics from our worker
apps! They don't expose an endpoint so there's no way to scrape them so
we aren't getting any value from prometheus anyway! But because they use
the same codebase they import gds_metrics and get that anyway.
## The solution
gds_metrics sets the multiproc env var, however, by importing prometheus
FIRST we ensure that the env var is unset at that point, and thus
prometheus will harmlessly store the metrics in memory.
To ensure that when we run the notify-api that still has the env var set
so the stats are shared across all the gunicorn processes, we put this
import as the first thing in run_celery.py
This is because that function is used both when broadcast status
is updated via API and via admin, so it's a shared resource.
Also move and update tests for updating broadcast message status
so things are tested at source and repetition is avoided.
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
If the reference from cancel CAP XML we received via API does not
match with any existing broadcast, return 404.
Do the same if service id doesn't match.
Also refactor code to cancel broadcast out into separate function
It should be a separate function that is only called by create_broadcast
function. This will prevent create_broadcast from becoming too
big and complex and doing too many things.
This follows a similar approach as [1]. Recently we've seen lots
of errors from this task, which we think are a consequence of it
doing too much work and tripping Celery's visibility timeout.
While we can optimise the query [2], it's likely the errors will
return as the number of live services grows. Parallelising the
aggregation now will make it more futureproof.
[1]: https://github.com/alphagov/notifications-api/pull/3397
[2]: https://github.com/alphagov/notifications-api/pull/3417
The previous DAO tests were also confusing because they were testing
two functions at the same time, so moving the tests up to the task
level seems very reasonable, and will make it easier to change how
this code works in the next commits.
This is similar to the corresponding endpoint for services. However,
it is a little simpler since we don't need to worry about always having
at least one team member for an organisation.
The new dao function added, `dao_remove_user_from_organisation`, is also
simpler than `dao_remove_user_from_service` since we don't have any
organisation permissions to deal with.
Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification
history table shows this is another instance of [1] but for the key
type column. Swapping "!=" for "IN" solves the problem.
[1]: https://github.com/alphagov/notifications-api/pull/3360
If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing.
There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.
This covers that we only exclude test notifications and the key
type is copied over correctly. In the next commits we're going to
modify this part of the query, so it's important it's covered.
We want admin to send a POST request to this route if the data contains
a message recipient (a phone number or email address) so that this does
not show in the logs. This changes the route to accept both GET and POST
requests.
We saw it fail again last night to calculate how many notifications
were sent for one of our services to put in the ft_notification_status
table. It ran in to the sqlalchemy statement timeout again.
To get us through the holiday
period lets make it 2 hours as surely that will be enough and then
we can fix this properly
Having a pool size of 30 connections means that if we receive a big
number of requests, with the current configuration, the API would end up
holding onto 30 connections per worker * 4 workers per instance * 35
instances = 4200 connections. With a limit of 5000 connections, this
means that we would only have 800 connections to share between the
workers or for overflow usage (btw, even the overflow for the API would
take us above the 5000 limit - 10 overflow connections per worker * 4 *
35 = 1400 connections, total 5600 _only_ for the API).
During our load tests this led to a deadlock situation where nothing
could retrieve connections to deal with a queue build-up.
The reduced pool size allowed for a much more graceful degradation of
the service where, after significant load we would increase the response
times but still manage to serve all the requests.
When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.