notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2026-01-30 06:21:50 -05:00

Author	SHA1	Message	Date
Leo Hemsted	1ad25dbc05	Merge pull request #3434 from alphagov/prom-celery disable prometheus writing to files from celery apps	2022-01-21 11:34:07 +00:00
Leo Hemsted	24d260218f	Merge pull request #3401 from alphagov/hide-log-line don't log if we dont delete anything for a service	2022-01-21 11:33:56 +00:00
Leo Hemsted	246016a894	don't log if we dont delete anything for a service we try and delete for lots of services. this includes services that don't actually have anything to delete that day. that might be because they had a custom data retention so we always go to check them, or because they only sent test notifications (which we'll delete but not include in the count in the log line). we don't really need to see log lines saying that we didn't delete anything for that service - that's just a long list of boring log messages that will hide the actual interesting stuff - which services we did delete content for.	2022-01-21 11:04:37 +00:00
Leo Hemsted	9a01c703fa	disable prometheus writing to files from celery apps ## The existing situation To support multiple processes and eventlets recording metrics in parallel, prometheus uses files to store metrics. When you write a metric from a multiprocess app, it writes to a file. Prometheus identifies whether your app is multiprocess by looking for the existence of a `prometheus_multiproc_dir` environment var (in either case). Prometheus reads this variable at a module level (ie: at import time). Assuming it will always used within a web server, the gds_metrics library auto-sets this to `/tmp` on import, to ensure that prometheus will always be set up correctly. We also have a variety of metrics set up when we create the app. These are generally sensible metrics such as counting the number of database connections in use by measuring sqlalchemy connection events. ## The problem We have seen problems with our notify-delivery-worker-reporting app run out of space. The CELERYD_MAX_TASKS_PER_CHILD flag is set on that app which restarts each worker process every time a task runs (to avoid memory issues), however we've recently massively decreased the size and increased the number of tasks to parallelise nightly tasks. Each time a worker process restarts it will write a new file to disk. This meant that we quickly ran out of disc space, and then the entire app instance was killed. The big rub is that we don't log prometheus metrics from our worker apps! They don't expose an endpoint so there's no way to scrape them so we aren't getting any value from prometheus anyway! But because they use the same codebase they import gds_metrics and get that anyway. ## The solution gds_metrics sets the multiproc env var, however, by importing prometheus FIRST we ensure that the env var is unset at that point, and thus prometheus will harmlessly store the metrics in memory. To ensure that when we run the notify-api that still has the env var set so the stats are shared across all the gunicorn processes, we put this import as the first thing in run_celery.py	2022-01-21 11:01:39 +00:00
Ben Thorner	731ebed224	Merge pull request #3433 from alphagov/revert-parallel-status-180693991 Revert running status aggregation in parallel	2022-01-20 13:11:33 +00:00
Ben Thorner	0f6dea0deb	Revert running status aggregation in parallel The top-level task didn't run successfully after this was deployed due to the worker being killed due to heavy disk usage. While the more parallel version does log much more, it doesn't totally explain the disk behaviour. Nonetheless, reverting it is sensible to give us the time we need to investigate more.	2022-01-20 12:22:33 +00:00
Ben Thorner	0a88724ff5	Merge pull request #3428 from alphagov/remove-dup-column Remove duplicate declaration for reference column	2022-01-19 13:49:44 +00:00
Ben Thorner	6be489daa7	Merge pull request #3425 from alphagov/parallelise-ft-status-180693991 Parallelise status aggregation by service and day	2022-01-19 13:49:28 +00:00
Ben Thorner	9686595fa8	Minor tweaks to address comments on the PR To address: - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786867994 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786853329 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786848793 - https://github.com/alphagov/notifications-api/pull/3425#discussion_r786214794	2022-01-18 16:56:53 +00:00
Ben Thorner	cfa6284af7	Remove duplicate declaration for reference column This is identical to the declaration a few lines above.	2022-01-17 12:03:14 +00:00
Katie Smith	5cd6fcbb4f	Merge pull request #3423 from alphagov/org-user-delte Add endpoint to allow org team members to be removed	2022-01-13 08:39:32 +00:00
Ben Thorner	086f0f50a6	Remove unnecessary extra method in status DAO This makes it easier to see what is being queried.	2022-01-12 15:48:00 +00:00
Ben Thorner	9182ebf4e5	Parallelise status aggregation by service and day This follows a similar approach as [1]. Recently we've seen lots of errors from this task, which we think are a consequence of it doing too much work and tripping Celery's visibility timeout. While we can optimise the query [2], it's likely the errors will return as the number of live services grows. Parallelising the aggregation now will make it more futureproof. [1]: https://github.com/alphagov/notifications-api/pull/3397 [2]: https://github.com/alphagov/notifications-api/pull/3417	2022-01-12 15:47:59 +00:00
Ben Thorner	c3da139e9c	Remove redundant migration tasks (esp. for status) These were added long ago [1][2] and aren't referenced in runbooks, so it should be safe to delete them. [1]: `13f3662051` [2]: `b9953dd005`	2022-01-12 15:47:58 +00:00
Ben Thorner	d772ae6b46	Standardise logs for status aggregation tasks This will make it easier to parallelise by service later on.	2022-01-12 15:47:57 +00:00
Ben Thorner	4feed950c4	DRY-up loops to kick off status aggregation tasks This will make it easier to parallelise by service in the following commits, since we only have one loop to change.	2022-01-12 15:47:56 +00:00
Ben Thorner	ddbf556486	Rewrite task to aggregate status by service This is a step towards parallelising the task by service and day.	2022-01-12 15:47:53 +00:00
Ben Thorner	9fc8b904c6	DRY up status aggregation tests (move DAO tests up) The previous DAO tests were also confusing because they were testing two functions at the same time, so moving the tests up to the task level seems very reasonable, and will make it easier to change how this code works in the next commits.	2022-01-11 16:11:36 +00:00
Katie Smith	ed725c1513	Add endpoint to allow org team members to be removed This is similar to the corresponding endpoint for services. However, it is a little simpler since we don't need to worry about always having at least one team member for an organisation. The new dao function added, `dao_remove_user_from_organisation`, is also simpler than `dao_remove_user_from_service` since we don't have any organisation permissions to deal with.	2022-01-11 15:20:48 +00:00
Ben Thorner	081e0cab88	Merge pull request #3417 from alphagov/optimise-status-query-180693991 Optimise query to populate notification statuses	2022-01-11 14:18:36 +00:00
Ben Thorner	63b5204fb0	Optimise query to populate notification statuses Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification history table shows this is another instance of [1] but for the key type column. Swapping "!=" for "IN" solves the problem. [1]: https://github.com/alphagov/notifications-api/pull/3360	2022-01-11 13:22:04 +00:00
Ben Thorner	e4dcea5396	Merge pull request #3421 from alphagov/explain-status-task-180693991 Add comment to explain status aggregation approach	2022-01-11 12:33:38 +00:00
Rebecca Law	ff7ee2cb63	Merge pull request #3422 from alphagov/fix-organisation-billing-query Fix bug in organisation report for its services and usages.	2022-01-11 11:43:01 +00:00
Rebecca Law	2257cae398	Fix bug in organisation report for its services and usages. If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing. There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.	2022-01-11 10:04:36 +00:00
Ben Thorner	a7b39a930c	Add comment to explain status aggregation approach This relates to the performance optimisation work we're doing [1]. Before optimising the task, it's worth asking if we can do less - the comment explains why it has to be this way. Some references to back up the comment: - We do status updates in either table [2]. - We don't allow duplicate receipts for emails [3]. - We don't allow duplicate receipts for SMS [4]. - We don't expect duplicate receipts for letters. This is something we would need to revisit if we want to support additional status updates - we could reject based on the age of the notification, rather than the status. [1]: https://github.com/alphagov/notifications-api/pull/3417 [2]: `20ead82463/app/dao/notifications_dao.py (L538)` [3]: `20ead82463/app/celery/process_ses_receipts_tasks.py (L58)` [4]: `20ead82463/app/dao/notifications_dao.py (L129-L135)`	2022-01-10 18:15:54 +00:00
Ben Thorner	394bf9abd9	Extend test for updating fact statuses This covers that we only exclude test notifications and the key type is copied over correctly. In the next commits we're going to modify this part of the query, so it's important it's covered.	2022-01-05 16:49:30 +00:00
Katie Smith	20ead82463	Merge pull request #3403 from alphagov/get-notis-post Allow `get_all_notifications_for_service` to accept POST requests	2022-01-04 14:26:52 +00:00
Katie Smith	13b6d1e490	Remove unused test function `set_up_get_all_from_hash` stopped being used in `52831813d8`	2022-01-04 14:04:03 +00:00
Katie Smith	3530d26ba3	Use `client` fixture everywhere There were a few tests which weren't using the `client` fixture but were using the code it contains. This simplifies them to use the fixture.	2022-01-04 14:04:03 +00:00
Katie Smith	0b7410818e	Allow `get_all_notifications_for_service` to accept POST requests We want admin to send a POST request to this route if the data contains a message recipient (a phone number or email address) so that this does not show in the logs. This changes the route to accept both GET and POST requests.	2022-01-04 14:04:03 +00:00
Ben Thorner	494a01ba57	Merge pull request #3415 from alphagov/standard-freeze-180760212 Centralise documentation for updating dependencies	2021-12-29 16:01:08 +00:00
Ben Thorner	c03647fb4b	Centralise documentation for updating dependencies This follows the convention established in [1]. [1]: https://github.com/alphagov/notifications-antivirus/pull/83	2021-12-29 14:59:38 +00:00
Richard Baker	ead9814af9	Merge pull request #3412 from alphagov/reduce-db-pool-size Reduce pool size from 30 to 15 connections	2021-12-24 11:40:37 +00:00
Richard Baker	10c09338c3	Merge pull request #3413 from alphagov/increase-timeout-more Bump sqlalchemy statement timeout even higher for reporting worker	2021-12-24 09:46:33 +00:00
David McDonald	edad1c9a21	Bump sqlalchemy statement timeout even higher for reporting worker We saw it fail again last night to calculate how many notifications were sent for one of our services to put in the ft_notification_status table. It ran in to the sqlalchemy statement timeout again. To get us through the holiday period lets make it 2 hours as surely that will be enough and then we can fix this properly	2021-12-24 08:56:42 +00:00
sakisv	ad8cf3f3a6	Reduce pool size from 30 to 15 connections Having a pool size of 30 connections means that if we receive a big number of requests, with the current configuration, the API would end up holding onto 30 connections per worker * 4 workers per instance * 35 instances = 4200 connections. With a limit of 5000 connections, this means that we would only have 800 connections to share between the workers or for overflow usage (btw, even the overflow for the API would take us above the 5000 limit - 10 overflow connections per worker * 4 * 35 = 1400 connections, total 5600 _only_ for the API). During our load tests this led to a deadlock situation where nothing could retrieve connections to deal with a queue build-up. The reduced pool size allowed for a much more graceful degradation of the service where, after significant load we would increase the response times but still manage to serve all the requests.	2021-12-23 19:28:17 +02:00
Rebecca Law	77084533fb	Merge pull request #3411 from alphagov/increase-timeout-for-reporting-worker Increase the SQL timeout for the `notify-delivery-worker-reporting` app.	2021-12-23 11:49:28 +00:00
Rebecca Law	603acc8b1e	Increase the SQL timeout for the `notify-delivery-worker-reporting` app. When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.	2021-12-23 11:41:49 +00:00
David McDonald	3a214da379	Merge pull request #3408 from alphagov/db-connection-close Close DB connection whilst making HTTP to SMS providers	2021-12-22 11:02:13 +00:00
David McDonald	2584946823	Close DB connection whilst making HTTP to SMS providers At the moment, when we are processing and sending an SMS we open a DB connection at the start of the celery task and then close it at the end of the celery task. Nice and simple. However, during that celery task we make an HTTP call out to our SMS providers. If our SMS providers have problems or response times start to slow then it means we have an open DB connection sat waiting for our SMS providers to respond which could take seconds. If our SMS providers grind to a halt, this would cause all of the celery tasks to hold on to their connections and we would run out of DB connections and Notify would fall over. We think we can solve this by closing the DB session which releases the DB connection back to the pool. Note, we've seen this happen in staging during load testing if our SMS provider stub has fallen over. We've never seen it in production and it may be less unlikely to happen as we are balancing traffic across two providers and they generally have very good uptime. One downside to be aware of is there could be a slight increase in time spent to send an SMS as we will now spend a bit of extra time closing the DB session and then reopening it again after the HTTP request is done. Note, there is no reason this approach couldn't be copied for our email provider too if it appears successful.	2021-12-21 17:45:53 +00:00
Pea Tyczynska	32cd7a0eb6	Merge pull request #3395 from alphagov/fix_org_usage_report Fix calculating remaining free allowance for SMS	2021-12-21 15:02:54 +00:00
Pea Tyczynska	d334e405c5	Refactor tests for sms remainder to make them easier to read	2021-12-21 14:43:56 +00:00
Ben Thorner	e55b654a0b	Merge pull request #3407 from alphagov/downgrade-inbound-log Downgrade log about orphaned inbound SMS	2021-12-21 13:36:10 +00:00
Ben Thorner	f65fb519c7	Merge pull request #3404 from alphagov/remove-redundant-conditional-180477467 Remove redundant conditional for letter branding	2021-12-21 13:35:59 +00:00
Ben Thorner	3d30965193	Downgrade log about orphaned inbound SMS We can't control who might be sending messages on inbound numbers that we own i.e. this log isn't an actionable error. Looks like it used to represent something that _was_ an error [1], but that's not the case anymore, so it seems reasonable to downgrade it. [1]: `d99ab329eb (diff-80d123d9abb40f80a221979940657a2751cc7cb33f255aa8f352a8324023e022L125)`	2021-12-21 12:49:00 +00:00
Ben Thorner	c52cb4a8a8	Merge pull request #3406 from alphagov/bump-utils-51-3-0-180693991 Bump utils to 51.3.0	2021-12-20 16:59:43 +00:00
Ben Thorner	491b7ce9ee	Bump utils to 51.3.0 This brings in new logging for the NotifyCelery base class [1]. [1]: https://github.com/alphagov/notifications-utils/pull/938	2021-12-20 16:45:47 +00:00
Ben Thorner	f4d967c0f1	Merge pull request #3405 from alphagov/downgrade-delete-letter-log-180692253 Downgrade log for letter deletion exceptions	2021-12-20 13:39:24 +00:00
Ben Thorner	de9ae08ecc	Downgrade log for letter deletion exceptions If the S3 object is missing [1], then that's what we want, so we don't need such a severe log for it, but we still want to know as it's not expected. This is separate to more general "ClientError" exceptions, which could mean anything. There weren't any tests to cover missing S3 objects, so I've added one. I don't think we need a test for ClientErrors: - If there was no handler, the task would fail and we'd learn about it that way. - The scope of the calling task is now much smaller, so it matters less than it used to [2]. [1]: `81a79e56ce/app/letters/utils.py (L52)` [2]: `f965322f25`	2021-12-20 12:45:48 +00:00
Ben Thorner	76da31c32a	Remove redundant conditional for letter branding This is no longer used when creating a service [1]. It was likely added at a migration point when Admin _did_ specify branding. [1]: `50c3c3e10c/app/main/views/add_service.py (L15-L22)`	2021-12-16 17:54:33 +00:00

1 2 3 4 5 ...

8626 Commits