Commit Graph

8637 Commits

Author SHA1 Message Date
Ben Thorner
0d71ee69f0 Revert increased timeout for reporting worker
This reverts commit 603acc8b1e +
This reverts commit edad1c9a21.

The cause of the slowness was fixed in [1] and since [2] we now
have data to prove it: each query to get the data is taking under
5 minutes, so it's safe to lower the timeout again.

[1]: https://github.com/alphagov/notifications-api/pull/3417
[2]: https://github.com/alphagov/notifications-api/pull/3437
2022-01-25 12:50:43 +00:00
Ben Thorner
68981a8b0d Merge pull request #3436 from alphagov/stop-restart-reporting-180693991
Stop killing reporting processes after each task
2022-01-24 13:39:13 +00:00
Ben Thorner
7ad0c4103a Stop killing reporting processes after each task
Previously we think this setting was necessary to avoid a memory
leak [1], but it's unclear if this is still an issue:

- We've advanced two major versions of Celery.
- Some of the tasks are now quicker and leaner.

Restarting worker sub-processes after each task is a big problem
for performance, as we move towards parallelising our reporting.

This is something of a test to see if we can manage without this
setting. Note that we need to unset the variable manually:

   cf unset-env notify-delivery-worker-reporting CELERYD_MAX_TASKS_PER_CHILD

In the worst case we can always re-run any failed tasks. To check
the worker is still behaving as expected, we can:

- Monitor CPU / memory graphs for it.
- Check `cf events` for unexpected restarts / crashes.
- Compare numbers of task completion logs to previous days.
- Check the number of new billing / status rows looks right.

[1]: ad419f7592
2022-01-24 12:52:52 +00:00
Pea Tyczynska
4a90cde701 Merge pull request #3429 from alphagov/cancel_alert_via_api
Cancel broadcast via API
2022-01-21 14:04:35 +00:00
Leo Hemsted
1ad25dbc05 Merge pull request #3434 from alphagov/prom-celery
disable prometheus writing to files from celery apps
2022-01-21 11:34:07 +00:00
Leo Hemsted
24d260218f Merge pull request #3401 from alphagov/hide-log-line
don't log if we dont delete anything for a service
2022-01-21 11:33:56 +00:00
Leo Hemsted
246016a894 don't log if we dont delete anything for a service
we try and delete for lots of services. this includes services that
don't actually have anything to delete that day. that might be because
they had a custom data retention so we always go to check them, or
because they only sent test notifications (which we'll delete but not
include in the count in the log line). we don't really need to see log
lines saying that we didn't delete anything for that service - that's
just a long list of boring log messages that will hide the actual
interesting stuff - which services we did delete content for.
2022-01-21 11:04:37 +00:00
Leo Hemsted
9a01c703fa disable prometheus writing to files from celery apps
## The existing situation

To support multiple processes and eventlets recording metrics in
parallel, prometheus uses files to store metrics. When you write a
metric from a multiprocess app, it writes to a file.

Prometheus identifies whether your app is multiprocess by looking for
the existence of a `prometheus_multiproc_dir` environment var (in either
case). Prometheus reads this variable at a module level (ie: at import
time). Assuming it will always used within a web server, the gds_metrics
library auto-sets this to `/tmp` on import, to ensure that prometheus
will always be set up correctly.

We also have a variety of metrics set up when we create the app. These
are generally sensible metrics such as counting the number of database
connections in use by measuring sqlalchemy connection events.

## The problem

We have seen problems with our notify-delivery-worker-reporting app run
out of space. The CELERYD_MAX_TASKS_PER_CHILD flag is set on that app
which restarts each worker process every time a task runs (to avoid
memory issues), however we've recently massively decreased the size and
increased the number of tasks to parallelise nightly tasks. Each time a
worker process restarts it will write a new file to disk. This meant
that we quickly ran out of disc space, and then the entire app instance
was killed.

The big rub is that we don't log prometheus metrics from our worker
apps! They don't expose an endpoint so there's no way to scrape them so
we aren't getting any value from prometheus anyway! But because they use
the same codebase they import gds_metrics and get that anyway.

## The solution

gds_metrics sets the multiproc env var, however, by importing prometheus
FIRST we ensure that the env var is unset at that point, and thus
prometheus will harmlessly store the metrics in memory.

To ensure that when we run the notify-api that still has the env var set
so the stats are shared across all the gunicorn processes, we put this
import as the first thing in run_celery.py
2022-01-21 11:01:39 +00:00
Pea Tyczynska
b6dd189462 Test cancel request via API returns 404 if service id does not match 2022-01-20 18:28:10 +00:00
Pea Tyczynska
52dbdb7518 Move validate_and_update_broadcast_message_status to a utils file
This is because that function is used both when broadcast status
is updated via API and via admin, so it's a shared resource.

Also move and update tests for updating broadcast message status
so things are tested at source and repetition is avoided.
2022-01-20 18:14:41 +00:00
Pea Tyczynska
c9afb2f038 Remove unnecessary error handling
The context here should be enough for the users, custom error
message is not needed.
2022-01-20 18:14:40 +00:00
Pea Tyczynska
c2a389e81a Move updating user validation out of validate_and_update_broadcast_message_status
As only 1 of 2 functions calling it needs that check, it's better
to perform it inside that 1 function.
2022-01-20 18:14:39 +00:00
Ben Thorner
731ebed224 Merge pull request #3433 from alphagov/revert-parallel-status-180693991
Revert running status aggregation in parallel
2022-01-20 13:11:33 +00:00
Ben Thorner
0f6dea0deb Revert running status aggregation in parallel
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
2022-01-20 12:22:33 +00:00
Pea Tyczynska
a4c20e8ba6 Return 404 if reference from cancel message does not match
If the reference from cancel CAP XML we received via API does not
match with any existing broadcast, return 404.

Do the same if service id doesn't match.

Also refactor code to cancel broadcast out into separate function

It should be a separate function that is only called by create_broadcast
function. This will prevent create_broadcast from becoming too
big and complex and doing too many things.
2022-01-19 15:42:27 +00:00
Pea Tyczynska
3b4a9d8942 Cancel broadcast via API
When a service sends us an XML CAP broadcast message with Cancel
status, and that broadcast is in broadcasting state, we cancel it.
2022-01-19 15:42:26 +00:00
Pea Tyczynska
940126abfb Reject unapproved broadcast upon cancel API request
When a service sends us a cancel broadcast XML via API, if that
broadcast was not approved yet, reject it.
2022-01-19 15:41:38 +00:00
Ben Thorner
0a88724ff5 Merge pull request #3428 from alphagov/remove-dup-column
Remove duplicate declaration for reference column
2022-01-19 13:49:44 +00:00
Ben Thorner
6be489daa7 Merge pull request #3425 from alphagov/parallelise-ft-status-180693991
Parallelise status aggregation by service and day
2022-01-19 13:49:28 +00:00
Ben Thorner
9686595fa8 Minor tweaks to address comments on the PR
To address:

- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786867994
- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786853329
- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786848793
- https://github.com/alphagov/notifications-api/pull/3425#discussion_r786214794
2022-01-18 16:56:53 +00:00
Ben Thorner
cfa6284af7 Remove duplicate declaration for reference column
This is identical to the declaration a few lines above.
2022-01-17 12:03:14 +00:00
Katie Smith
5cd6fcbb4f Merge pull request #3423 from alphagov/org-user-delte
Add endpoint to allow org team members to be removed
2022-01-13 08:39:32 +00:00
Ben Thorner
086f0f50a6 Remove unnecessary extra method in status DAO
This makes it easier to see what is being queried.
2022-01-12 15:48:00 +00:00
Ben Thorner
9182ebf4e5 Parallelise status aggregation by service and day
This follows a similar approach as [1]. Recently we've seen lots
of errors from this task, which we think are a consequence of it
doing too much work and tripping Celery's visibility timeout.

While we can optimise the query [2], it's likely the errors will
return as the number of live services grows. Parallelising the
aggregation now will make it more futureproof.

[1]: https://github.com/alphagov/notifications-api/pull/3397
[2]: https://github.com/alphagov/notifications-api/pull/3417
2022-01-12 15:47:59 +00:00
Ben Thorner
c3da139e9c Remove redundant migration tasks (esp. for status)
These were added long ago [1][2] and aren't referenced in runbooks,
so it should be safe to delete them.

[1]: 13f3662051
[2]: b9953dd005
2022-01-12 15:47:58 +00:00
Ben Thorner
d772ae6b46 Standardise logs for status aggregation tasks
This will make it easier to parallelise by service later on.
2022-01-12 15:47:57 +00:00
Ben Thorner
4feed950c4 DRY-up loops to kick off status aggregation tasks
This will make it easier to parallelise by service in the following
commits, since we only have one loop to change.
2022-01-12 15:47:56 +00:00
Ben Thorner
ddbf556486 Rewrite task to aggregate status by service
This is a step towards parallelising the task by service and day.
2022-01-12 15:47:53 +00:00
Ben Thorner
9fc8b904c6 DRY up status aggregation tests (move DAO tests up)
The previous DAO tests were also confusing because they were testing
two functions at the same time, so moving the tests up to the task
level seems very reasonable, and will make it easier to change how
this code works in the next commits.
2022-01-11 16:11:36 +00:00
Katie Smith
ed725c1513 Add endpoint to allow org team members to be removed
This is similar to the corresponding endpoint for services. However,
it is a little simpler since we don't need to worry about always having
at least one team member for an organisation.

The new dao function added, `dao_remove_user_from_organisation`, is also
simpler than `dao_remove_user_from_service` since we don't have any
organisation permissions to deal with.
2022-01-11 15:20:48 +00:00
Ben Thorner
081e0cab88 Merge pull request #3417 from alphagov/optimise-status-query-180693991
Optimise query to populate notification statuses
2022-01-11 14:18:36 +00:00
Ben Thorner
63b5204fb0 Optimise query to populate notification statuses
Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification
history table shows this is another instance of [1] but for the key
type column. Swapping "!=" for "IN" solves the problem.

[1]: https://github.com/alphagov/notifications-api/pull/3360
2022-01-11 13:22:04 +00:00
Ben Thorner
e4dcea5396 Merge pull request #3421 from alphagov/explain-status-task-180693991
Add comment to explain status aggregation approach
2022-01-11 12:33:38 +00:00
Rebecca Law
ff7ee2cb63 Merge pull request #3422 from alphagov/fix-organisation-billing-query
Fix bug in organisation report for its services and usages.
2022-01-11 11:43:01 +00:00
Rebecca Law
2257cae398 Fix bug in organisation report for its services and usages.
If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing.

There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.
2022-01-11 10:04:36 +00:00
Ben Thorner
a7b39a930c Add comment to explain status aggregation approach
This relates to the performance optimisation work we're doing [1].
Before optimising the task, it's worth asking if we can do less -
the comment explains why it has to be this way.

Some references to back up the comment:

- We do status updates in either table [2].
- We don't allow duplicate receipts for emails [3].
- We don't allow duplicate receipts for SMS [4].
- We don't expect duplicate receipts for letters.

This is something we would need to revisit if we want to support
additional status updates - we could reject based on the age of the
notification, rather than the status.

[1]: https://github.com/alphagov/notifications-api/pull/3417
[2]: 20ead82463/app/dao/notifications_dao.py (L538)
[3]: 20ead82463/app/celery/process_ses_receipts_tasks.py (L58)
[4]: 20ead82463/app/dao/notifications_dao.py (L129-L135)
2022-01-10 18:15:54 +00:00
Ben Thorner
394bf9abd9 Extend test for updating fact statuses
This covers that we only exclude test notifications and the key
type is copied over correctly. In the next commits we're going to
modify this part of the query, so it's important it's covered.
2022-01-05 16:49:30 +00:00
Katie Smith
20ead82463 Merge pull request #3403 from alphagov/get-notis-post
Allow `get_all_notifications_for_service` to accept POST requests
2022-01-04 14:26:52 +00:00
Katie Smith
13b6d1e490 Remove unused test function
`set_up_get_all_from_hash` stopped being used in 52831813d8
2022-01-04 14:04:03 +00:00
Katie Smith
3530d26ba3 Use client fixture everywhere
There were a few tests which weren't using the `client` fixture but were
using the code it contains. This simplifies them to use the fixture.
2022-01-04 14:04:03 +00:00
Katie Smith
0b7410818e Allow get_all_notifications_for_service to accept POST requests
We want admin to send a POST request to this route if the data contains
a message recipient (a phone number or email address) so that this does
not show in the logs. This changes the route to accept both GET and POST
requests.
2022-01-04 14:04:03 +00:00
Ben Thorner
494a01ba57 Merge pull request #3415 from alphagov/standard-freeze-180760212
Centralise documentation for updating dependencies
2021-12-29 16:01:08 +00:00
Ben Thorner
c03647fb4b Centralise documentation for updating dependencies
This follows the convention established in [1].

[1]: https://github.com/alphagov/notifications-antivirus/pull/83
2021-12-29 14:59:38 +00:00
Richard Baker
ead9814af9 Merge pull request #3412 from alphagov/reduce-db-pool-size
Reduce pool size from 30 to 15 connections
2021-12-24 11:40:37 +00:00
Richard Baker
10c09338c3 Merge pull request #3413 from alphagov/increase-timeout-more
Bump sqlalchemy statement timeout even higher for reporting worker
2021-12-24 09:46:33 +00:00
David McDonald
edad1c9a21 Bump sqlalchemy statement timeout even higher for reporting worker
We saw it fail again last night to calculate how many notifications
were sent for one of our services to put in the ft_notification_status
table. It ran in to the sqlalchemy statement timeout again.
To get us through the holiday
period lets make it 2 hours as surely that will be enough and then
we can fix this properly
2021-12-24 08:56:42 +00:00
sakisv
ad8cf3f3a6 Reduce pool size from 30 to 15 connections
Having a pool size of 30 connections means that if we receive a big
number of requests, with the current configuration, the API would end up
holding onto 30 connections per worker * 4 workers per instance * 35
instances = 4200 connections. With a limit of 5000 connections, this
means that we would only have 800 connections to share between the
workers or for overflow usage (btw, even the overflow for the API would
take us above the 5000 limit - 10 overflow connections per worker * 4 *
35 = 1400 connections, total 5600 _only_ for the API).

During our load tests this led to a deadlock situation where nothing
could retrieve connections to deal with a queue build-up.

The reduced pool size allowed for a much more graceful degradation of
the service where, after significant load we would increase the response
times but still manage to serve all the requests.
2021-12-23 19:28:17 +02:00
Rebecca Law
77084533fb Merge pull request #3411 from alphagov/increase-timeout-for-reporting-worker
Increase the SQL timeout for the `notify-delivery-worker-reporting` app.
2021-12-23 11:49:28 +00:00
Rebecca Law
603acc8b1e Increase the SQL timeout for the notify-delivery-worker-reporting app.
When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.
2021-12-23 11:41:49 +00:00
David McDonald
3a214da379 Merge pull request #3408 from alphagov/db-connection-close
Close DB connection whilst making HTTP to SMS providers
2021-12-22 11:02:13 +00:00