notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2026-07-10 19:33:56 -04:00

Author	SHA1	Message	Date
Ben Thorner	0d71ee69f0	Revert increased timeout for reporting worker This reverts commit `603acc8b1e` + This reverts commit `edad1c9a21`. The cause of the slowness was fixed in [1] and since [2] we now have data to prove it: each query to get the data is taking under 5 minutes, so it's safe to lower the timeout again. [1]: https://github.com/alphagov/notifications-api/pull/3417 [2]: https://github.com/alphagov/notifications-api/pull/3437	2022-01-25 12:50:43 +00:00
Ben Thorner	68981a8b0d	Merge pull request #3436 from alphagov/stop-restart-reporting-180693991 Stop killing reporting processes after each task	2022-01-24 13:39:13 +00:00
Ben Thorner	7ad0c4103a	Stop killing reporting processes after each task Previously we think this setting was necessary to avoid a memory leak [1], but it's unclear if this is still an issue: - We've advanced two major versions of Celery. - Some of the tasks are now quicker and leaner. Restarting worker sub-processes after each task is a big problem for performance, as we move towards parallelising our reporting. This is something of a test to see if we can manage without this setting. Note that we need to unset the variable manually: cf unset-env notify-delivery-worker-reporting CELERYD_MAX_TASKS_PER_CHILD In the worst case we can always re-run any failed tasks. To check the worker is still behaving as expected, we can: - Monitor CPU / memory graphs for it. - Check `cf events` for unexpected restarts / crashes. - Compare numbers of task completion logs to previous days. - Check the number of new billing / status rows looks right. [1]: `ad419f7592`	2022-01-24 12:52:52 +00:00
Pea Tyczynska	4a90cde701	Merge pull request #3429 from alphagov/cancel_alert_via_api Cancel broadcast via API	2022-01-21 14:04:35 +00:00
Leo Hemsted	1ad25dbc05	Merge pull request #3434 from alphagov/prom-celery disable prometheus writing to files from celery apps	2022-01-21 11:34:07 +00:00
Leo Hemsted	24d260218f	Merge pull request #3401 from alphagov/hide-log-line don't log if we dont delete anything for a service	2022-01-21 11:33:56 +00:00
Leo Hemsted	246016a894	don't log if we dont delete anything for a service we try and delete for lots of services. this includes services that don't actually have anything to delete that day. that might be because they had a custom data retention so we always go to check them, or because they only sent test notifications (which we'll delete but not include in the count in the log line). we don't really need to see log lines saying that we didn't delete anything for that service - that's just a long list of boring log messages that will hide the actual interesting stuff - which services we did delete content for.	2022-01-21 11:04:37 +00:00
Leo Hemsted	9a01c703fa	disable prometheus writing to files from celery apps ## The existing situation To support multiple processes and eventlets recording metrics in parallel, prometheus uses files to store metrics. When you write a metric from a multiprocess app, it writes to a file. Prometheus identifies whether your app is multiprocess by looking for the existence of a `prometheus_multiproc_dir` environment var (in either case). Prometheus reads this variable at a module level (ie: at import time). Assuming it will always used within a web server, the gds_metrics library auto-sets this to `/tmp` on import, to ensure that prometheus will always be set up correctly. We also have a variety of metrics set up when we create the app. These are generally sensible metrics such as counting the number of database connections in use by measuring sqlalchemy connection events. ## The problem We have seen problems with our notify-delivery-worker-reporting app run out of space. The CELERYD_MAX_TASKS_PER_CHILD flag is set on that app which restarts each worker process every time a task runs (to avoid memory issues), however we've recently massively decreased the size and increased the number of tasks to parallelise nightly tasks. Each time a worker process restarts it will write a new file to disk. This meant that we quickly ran out of disc space, and then the entire app instance was killed. The big rub is that we don't log prometheus metrics from our worker apps! They don't expose an endpoint so there's no way to scrape them so we aren't getting any value from prometheus anyway! But because they use the same codebase they import gds_metrics and get that anyway. ## The solution gds_metrics sets the multiproc env var, however, by importing prometheus FIRST we ensure that the env var is unset at that point, and thus prometheus will harmlessly store the metrics in memory. To ensure that when we run the notify-api that still has the env var set so the stats are shared across all the gunicorn processes, we put this import as the first thing in run_celery.py	2022-01-21 11:01:39 +00:00
Pea Tyczynska	b6dd189462	Test cancel request via API returns 404 if service id does not match	2022-01-20 18:28:10 +00:00
Pea Tyczynska	52dbdb7518	Move validate_and_update_broadcast_message_status to a utils file This is because that function is used both when broadcast status is updated via API and via admin, so it's a shared resource. Also move and update tests for updating broadcast message status so things are tested at source and repetition is avoided.	2022-01-20 18:14:41 +00:00
Pea Tyczynska	c9afb2f038	Remove unnecessary error handling The context here should be enough for the users, custom error message is not needed.	2022-01-20 18:14:40 +00:00
Pea Tyczynska	c2a389e81a	Move updating user validation out of validate_and_update_broadcast_message_status As only 1 of 2 functions calling it needs that check, it's better to perform it inside that 1 function.	2022-01-20 18:14:39 +00:00
Ben Thorner	731ebed224	Merge pull request #3433 from alphagov/revert-parallel-status-180693991 Revert running status aggregation in parallel	2022-01-20 13:11:33 +00:00
Ben Thorner	0f6dea0deb	Revert running status aggregation in parallel The top-level task didn't run successfully after this was deployed due to the worker being killed due to heavy disk usage. While the more parallel version does log much more, it doesn't totally explain the disk behaviour. Nonetheless, reverting it is sensible to give us the time we need to investigate more.	2022-01-20 12:22:33 +00:00
Pea Tyczynska	a4c20e8ba6	Return 404 if reference from cancel message does not match If the reference from cancel CAP XML we received via API does not match with any existing broadcast, return 404. Do the same if service id doesn't match. Also refactor code to cancel broadcast out into separate function It should be a separate function that is only called by create_broadcast function. This will prevent create_broadcast from becoming too big and complex and doing too many things.	2022-01-19 15:42:27 +00:00
Pea Tyczynska	3b4a9d8942	Cancel broadcast via API When a service sends us an XML CAP broadcast message with Cancel status, and that broadcast is in broadcasting state, we cancel it.	2022-01-19 15:42:26 +00:00
Pea Tyczynska	940126abfb	Reject unapproved broadcast upon cancel API request When a service sends us a cancel broadcast XML via API, if that broadcast was not approved yet, reject it.	2022-01-19 15:41:38 +00:00
Ben Thorner	0a88724ff5	Merge pull request #3428 from alphagov/remove-dup-column Remove duplicate declaration for reference column	2022-01-19 13:49:44 +00:00
Ben Thorner	6be489daa7	Merge pull request #3425 from alphagov/parallelise-ft-status-180693991 Parallelise status aggregation by service and day	2022-01-19 13:49:28 +00:00
Ben Thorner	9686595fa8	Minor tweaks to address comments on the PR To address: - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786867994 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786853329 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786848793 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786214794	2022-01-18 16:56:53 +00:00
Ben Thorner	cfa6284af7	Remove duplicate declaration for reference column This is identical to the declaration a few lines above.	2022-01-17 12:03:14 +00:00
Katie Smith	5cd6fcbb4f	Merge pull request #3423 from alphagov/org-user-delte Add endpoint to allow org team members to be removed	2022-01-13 08:39:32 +00:00
Ben Thorner	086f0f50a6	Remove unnecessary extra method in status DAO This makes it easier to see what is being queried.	2022-01-12 15:48:00 +00:00
Ben Thorner	9182ebf4e5	Parallelise status aggregation by service and day This follows a similar approach as [1]. Recently we've seen lots of errors from this task, which we think are a consequence of it doing too much work and tripping Celery's visibility timeout. While we can optimise the query [2], it's likely the errors will return as the number of live services grows. Parallelising the aggregation now will make it more futureproof. [1]: https://github.com/alphagov/notifications-api/pull/3397 [2]: https://github.com/alphagov/notifications-api/pull/3417	2022-01-12 15:47:59 +00:00
Ben Thorner	c3da139e9c	Remove redundant migration tasks (esp. for status) These were added long ago [1][2] and aren't referenced in runbooks, so it should be safe to delete them. [1]: `13f3662051` [2]: `b9953dd005`	2022-01-12 15:47:58 +00:00
Ben Thorner	d772ae6b46	Standardise logs for status aggregation tasks This will make it easier to parallelise by service later on.	2022-01-12 15:47:57 +00:00
Ben Thorner	4feed950c4	DRY-up loops to kick off status aggregation tasks This will make it easier to parallelise by service in the following commits, since we only have one loop to change.	2022-01-12 15:47:56 +00:00
Ben Thorner	ddbf556486	Rewrite task to aggregate status by service This is a step towards parallelising the task by service and day.	2022-01-12 15:47:53 +00:00
Ben Thorner	9fc8b904c6	DRY up status aggregation tests (move DAO tests up) The previous DAO tests were also confusing because they were testing two functions at the same time, so moving the tests up to the task level seems very reasonable, and will make it easier to change how this code works in the next commits.	2022-01-11 16:11:36 +00:00
Katie Smith	ed725c1513	Add endpoint to allow org team members to be removed This is similar to the corresponding endpoint for services. However, it is a little simpler since we don't need to worry about always having at least one team member for an organisation. The new dao function added, `dao_remove_user_from_organisation`, is also simpler than `dao_remove_user_from_service` since we don't have any organisation permissions to deal with.	2022-01-11 15:20:48 +00:00
Ben Thorner	081e0cab88	Merge pull request #3417 from alphagov/optimise-status-query-180693991 Optimise query to populate notification statuses	2022-01-11 14:18:36 +00:00
Ben Thorner	63b5204fb0	Optimise query to populate notification statuses Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification history table shows this is another instance of [1] but for the key type column. Swapping "!=" for "IN" solves the problem. [1]: https://github.com/alphagov/notifications-api/pull/3360	2022-01-11 13:22:04 +00:00
Ben Thorner	e4dcea5396	Merge pull request #3421 from alphagov/explain-status-task-180693991 Add comment to explain status aggregation approach	2022-01-11 12:33:38 +00:00
Rebecca Law	ff7ee2cb63	Merge pull request #3422 from alphagov/fix-organisation-billing-query Fix bug in organisation report for its services and usages.	2022-01-11 11:43:01 +00:00
Rebecca Law	2257cae398	Fix bug in organisation report for its services and usages. If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing. There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.	2022-01-11 10:04:36 +00:00
Ben Thorner	a7b39a930c	Add comment to explain status aggregation approach This relates to the performance optimisation work we're doing [1]. Before optimising the task, it's worth asking if we can do less - the comment explains why it has to be this way. Some references to back up the comment: - We do status updates in either table [2]. - We don't allow duplicate receipts for emails [3]. - We don't allow duplicate receipts for SMS [4]. - We don't expect duplicate receipts for letters. This is something we would need to revisit if we want to support additional status updates - we could reject based on the age of the notification, rather than the status. [1]: https://github.com/alphagov/notifications-api/pull/3417 [2]: `20ead82463/app/dao/notifications_dao.py (L538)` [3]: `20ead82463/app/celery/process_ses_receipts_tasks.py (L58)` [4]: `20ead82463/app/dao/notifications_dao.py (L129-L135)`	2022-01-10 18:15:54 +00:00
Ben Thorner	394bf9abd9	Extend test for updating fact statuses This covers that we only exclude test notifications and the key type is copied over correctly. In the next commits we're going to modify this part of the query, so it's important it's covered.	2022-01-05 16:49:30 +00:00
Katie Smith	20ead82463	Merge pull request #3403 from alphagov/get-notis-post Allow `get_all_notifications_for_service` to accept POST requests	2022-01-04 14:26:52 +00:00
Katie Smith	13b6d1e490	Remove unused test function `set_up_get_all_from_hash` stopped being used in `52831813d8`	2022-01-04 14:04:03 +00:00
Katie Smith	3530d26ba3	Use `client` fixture everywhere There were a few tests which weren't using the `client` fixture but were using the code it contains. This simplifies them to use the fixture.	2022-01-04 14:04:03 +00:00
Katie Smith	0b7410818e	Allow `get_all_notifications_for_service` to accept POST requests We want admin to send a POST request to this route if the data contains a message recipient (a phone number or email address) so that this does not show in the logs. This changes the route to accept both GET and POST requests.	2022-01-04 14:04:03 +00:00
Ben Thorner	494a01ba57	Merge pull request #3415 from alphagov/standard-freeze-180760212 Centralise documentation for updating dependencies	2021-12-29 16:01:08 +00:00
Ben Thorner	c03647fb4b	Centralise documentation for updating dependencies This follows the convention established in [1]. [1]: https://github.com/alphagov/notifications-antivirus/pull/83	2021-12-29 14:59:38 +00:00
Richard Baker	ead9814af9	Merge pull request #3412 from alphagov/reduce-db-pool-size Reduce pool size from 30 to 15 connections	2021-12-24 11:40:37 +00:00
Richard Baker	10c09338c3	Merge pull request #3413 from alphagov/increase-timeout-more Bump sqlalchemy statement timeout even higher for reporting worker	2021-12-24 09:46:33 +00:00
David McDonald	edad1c9a21	Bump sqlalchemy statement timeout even higher for reporting worker We saw it fail again last night to calculate how many notifications were sent for one of our services to put in the ft_notification_status table. It ran in to the sqlalchemy statement timeout again. To get us through the holiday period lets make it 2 hours as surely that will be enough and then we can fix this properly	2021-12-24 08:56:42 +00:00
sakisv	ad8cf3f3a6	Reduce pool size from 30 to 15 connections Having a pool size of 30 connections means that if we receive a big number of requests, with the current configuration, the API would end up holding onto 30 connections per worker * 4 workers per instance * 35 instances = 4200 connections. With a limit of 5000 connections, this means that we would only have 800 connections to share between the workers or for overflow usage (btw, even the overflow for the API would take us above the 5000 limit - 10 overflow connections per worker * 4 * 35 = 1400 connections, total 5600 _only_ for the API). During our load tests this led to a deadlock situation where nothing could retrieve connections to deal with a queue build-up. The reduced pool size allowed for a much more graceful degradation of the service where, after significant load we would increase the response times but still manage to serve all the requests.	2021-12-23 19:28:17 +02:00
Rebecca Law	77084533fb	Merge pull request #3411 from alphagov/increase-timeout-for-reporting-worker Increase the SQL timeout for the `notify-delivery-worker-reporting` app.	2021-12-23 11:49:28 +00:00
Rebecca Law	603acc8b1e	Increase the SQL timeout for the `notify-delivery-worker-reporting` app. When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.	2021-12-23 11:41:49 +00:00
David McDonald	3a214da379	Merge pull request #3408 from alphagov/db-connection-close Close DB connection whilst making HTTP to SMS providers	2021-12-22 11:02:13 +00:00

1 2 3 4 5 ...

8637 Commits