notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2026-01-16 23:51:25 -05:00

Author	SHA1	Message	Date
Rebecca Law	c01c81326c	Update log message to something a little easier to read and query for.	2022-01-24 12:25:53 +00:00
Rebecca Law	6cd7a23d3c	If there is an invalid letter that has not been updated to `validation-failed` because the `update-validation-failed-for-templated-letter` has not been picked up off the letter-tasks queue and the `collate-letter-pdfs-to-be-sent` has started. 1. The number of letters that we send to DVLA will be not be correct (see `20ead82463/app/celery/letters_pdf_tasks.py (L136)`) This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported. 2. When we get the PDF from S3 we will get a file not found `20ead82463/app/celery/letters_pdf_tasks.py (L244)` The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions. Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.	2022-01-19 08:31:19 +00:00
Rebecca Law	841a4fc22f	Mark letters as validation-failed if the templated letter is too long. It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications. NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload. A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task. New work flow for a letter in brief: API - receives POST /v2/notifications :: save to db :: put CREATE_LETTERS_PDF task on queue for template preview to consume TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF :: create PDF :: count pages of PDF :: IF page count exceeds 10 pages put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters) put `update-validation-failed-for-templated-letter` task on the queue for the API to consume ELSE put PDF in the `letters-pdf` bucket put `update-billable-units-for-letter` task on the queue API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task :: IF `update-billable-units-for-letter` task: update billable units for notification as usual :: ELSE `update-validation-failed-for-templated-letter`: update notification_status = `validation-failed` ADMIN - view notification page for letter :: show validation letter for templated letter There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app. Deployment plan Deploy Admin first Deploy API Deploy template-preview Related PRs: alphagov/notifications-template-preview#619 alphagov/notifications-admin#4107 https://www.pivotaltracker.com/story/show/169209742	2022-01-19 08:29:48 +00:00
Katie Smith	5cd6fcbb4f	Merge pull request #3423 from alphagov/org-user-delte Add endpoint to allow org team members to be removed	2022-01-13 08:39:32 +00:00
Katie Smith	ed725c1513	Add endpoint to allow org team members to be removed This is similar to the corresponding endpoint for services. However, it is a little simpler since we don't need to worry about always having at least one team member for an organisation. The new dao function added, `dao_remove_user_from_organisation`, is also simpler than `dao_remove_user_from_service` since we don't have any organisation permissions to deal with.	2022-01-11 15:20:48 +00:00
Ben Thorner	081e0cab88	Merge pull request #3417 from alphagov/optimise-status-query-180693991 Optimise query to populate notification statuses	2022-01-11 14:18:36 +00:00
Ben Thorner	63b5204fb0	Optimise query to populate notification statuses Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification history table shows this is another instance of [1] but for the key type column. Swapping "!=" for "IN" solves the problem. [1]: https://github.com/alphagov/notifications-api/pull/3360	2022-01-11 13:22:04 +00:00
Ben Thorner	e4dcea5396	Merge pull request #3421 from alphagov/explain-status-task-180693991 Add comment to explain status aggregation approach	2022-01-11 12:33:38 +00:00
Rebecca Law	ff7ee2cb63	Merge pull request #3422 from alphagov/fix-organisation-billing-query Fix bug in organisation report for its services and usages.	2022-01-11 11:43:01 +00:00
Rebecca Law	2257cae398	Fix bug in organisation report for its services and usages. If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing. There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.	2022-01-11 10:04:36 +00:00
Ben Thorner	a7b39a930c	Add comment to explain status aggregation approach This relates to the performance optimisation work we're doing [1]. Before optimising the task, it's worth asking if we can do less - the comment explains why it has to be this way. Some references to back up the comment: - We do status updates in either table [2]. - We don't allow duplicate receipts for emails [3]. - We don't allow duplicate receipts for SMS [4]. - We don't expect duplicate receipts for letters. This is something we would need to revisit if we want to support additional status updates - we could reject based on the age of the notification, rather than the status. [1]: https://github.com/alphagov/notifications-api/pull/3417 [2]: `20ead82463/app/dao/notifications_dao.py (L538)` [3]: `20ead82463/app/celery/process_ses_receipts_tasks.py (L58)` [4]: `20ead82463/app/dao/notifications_dao.py (L129-L135)`	2022-01-10 18:15:54 +00:00
Ben Thorner	394bf9abd9	Extend test for updating fact statuses This covers that we only exclude test notifications and the key type is copied over correctly. In the next commits we're going to modify this part of the query, so it's important it's covered.	2022-01-05 16:49:30 +00:00
Katie Smith	20ead82463	Merge pull request #3403 from alphagov/get-notis-post Allow `get_all_notifications_for_service` to accept POST requests	2022-01-04 14:26:52 +00:00
Katie Smith	13b6d1e490	Remove unused test function `set_up_get_all_from_hash` stopped being used in `52831813d8`	2022-01-04 14:04:03 +00:00
Katie Smith	3530d26ba3	Use `client` fixture everywhere There were a few tests which weren't using the `client` fixture but were using the code it contains. This simplifies them to use the fixture.	2022-01-04 14:04:03 +00:00
Katie Smith	0b7410818e	Allow `get_all_notifications_for_service` to accept POST requests We want admin to send a POST request to this route if the data contains a message recipient (a phone number or email address) so that this does not show in the logs. This changes the route to accept both GET and POST requests.	2022-01-04 14:04:03 +00:00
Ben Thorner	494a01ba57	Merge pull request #3415 from alphagov/standard-freeze-180760212 Centralise documentation for updating dependencies	2021-12-29 16:01:08 +00:00
Ben Thorner	c03647fb4b	Centralise documentation for updating dependencies This follows the convention established in [1]. [1]: https://github.com/alphagov/notifications-antivirus/pull/83	2021-12-29 14:59:38 +00:00
Richard Baker	ead9814af9	Merge pull request #3412 from alphagov/reduce-db-pool-size Reduce pool size from 30 to 15 connections	2021-12-24 11:40:37 +00:00
Richard Baker	10c09338c3	Merge pull request #3413 from alphagov/increase-timeout-more Bump sqlalchemy statement timeout even higher for reporting worker	2021-12-24 09:46:33 +00:00
David McDonald	edad1c9a21	Bump sqlalchemy statement timeout even higher for reporting worker We saw it fail again last night to calculate how many notifications were sent for one of our services to put in the ft_notification_status table. It ran in to the sqlalchemy statement timeout again. To get us through the holiday period lets make it 2 hours as surely that will be enough and then we can fix this properly	2021-12-24 08:56:42 +00:00
sakisv	ad8cf3f3a6	Reduce pool size from 30 to 15 connections Having a pool size of 30 connections means that if we receive a big number of requests, with the current configuration, the API would end up holding onto 30 connections per worker * 4 workers per instance * 35 instances = 4200 connections. With a limit of 5000 connections, this means that we would only have 800 connections to share between the workers or for overflow usage (btw, even the overflow for the API would take us above the 5000 limit - 10 overflow connections per worker * 4 * 35 = 1400 connections, total 5600 _only_ for the API). During our load tests this led to a deadlock situation where nothing could retrieve connections to deal with a queue build-up. The reduced pool size allowed for a much more graceful degradation of the service where, after significant load we would increase the response times but still manage to serve all the requests.	2021-12-23 19:28:17 +02:00
Rebecca Law	77084533fb	Merge pull request #3411 from alphagov/increase-timeout-for-reporting-worker Increase the SQL timeout for the `notify-delivery-worker-reporting` app.	2021-12-23 11:49:28 +00:00
Rebecca Law	603acc8b1e	Increase the SQL timeout for the `notify-delivery-worker-reporting` app. When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.	2021-12-23 11:41:49 +00:00
David McDonald	3a214da379	Merge pull request #3408 from alphagov/db-connection-close Close DB connection whilst making HTTP to SMS providers	2021-12-22 11:02:13 +00:00
David McDonald	2584946823	Close DB connection whilst making HTTP to SMS providers At the moment, when we are processing and sending an SMS we open a DB connection at the start of the celery task and then close it at the end of the celery task. Nice and simple. However, during that celery task we make an HTTP call out to our SMS providers. If our SMS providers have problems or response times start to slow then it means we have an open DB connection sat waiting for our SMS providers to respond which could take seconds. If our SMS providers grind to a halt, this would cause all of the celery tasks to hold on to their connections and we would run out of DB connections and Notify would fall over. We think we can solve this by closing the DB session which releases the DB connection back to the pool. Note, we've seen this happen in staging during load testing if our SMS provider stub has fallen over. We've never seen it in production and it may be less unlikely to happen as we are balancing traffic across two providers and they generally have very good uptime. One downside to be aware of is there could be a slight increase in time spent to send an SMS as we will now spend a bit of extra time closing the DB session and then reopening it again after the HTTP request is done. Note, there is no reason this approach couldn't be copied for our email provider too if it appears successful.	2021-12-21 17:45:53 +00:00
Pea Tyczynska	32cd7a0eb6	Merge pull request #3395 from alphagov/fix_org_usage_report Fix calculating remaining free allowance for SMS	2021-12-21 15:02:54 +00:00
Pea Tyczynska	d334e405c5	Refactor tests for sms remainder to make them easier to read	2021-12-21 14:43:56 +00:00
Ben Thorner	e55b654a0b	Merge pull request #3407 from alphagov/downgrade-inbound-log Downgrade log about orphaned inbound SMS	2021-12-21 13:36:10 +00:00
Ben Thorner	f65fb519c7	Merge pull request #3404 from alphagov/remove-redundant-conditional-180477467 Remove redundant conditional for letter branding	2021-12-21 13:35:59 +00:00
Ben Thorner	3d30965193	Downgrade log about orphaned inbound SMS We can't control who might be sending messages on inbound numbers that we own i.e. this log isn't an actionable error. Looks like it used to represent something that _was_ an error [1], but that's not the case anymore, so it seems reasonable to downgrade it. [1]: `d99ab329eb (diff-80d123d9abb40f80a221979940657a2751cc7cb33f255aa8f352a8324023e022L125)`	2021-12-21 12:49:00 +00:00
Ben Thorner	c52cb4a8a8	Merge pull request #3406 from alphagov/bump-utils-51-3-0-180693991 Bump utils to 51.3.0	2021-12-20 16:59:43 +00:00
Ben Thorner	491b7ce9ee	Bump utils to 51.3.0 This brings in new logging for the NotifyCelery base class [1]. [1]: https://github.com/alphagov/notifications-utils/pull/938	2021-12-20 16:45:47 +00:00
Ben Thorner	f4d967c0f1	Merge pull request #3405 from alphagov/downgrade-delete-letter-log-180692253 Downgrade log for letter deletion exceptions	2021-12-20 13:39:24 +00:00
Ben Thorner	de9ae08ecc	Downgrade log for letter deletion exceptions If the S3 object is missing [1], then that's what we want, so we don't need such a severe log for it, but we still want to know as it's not expected. This is separate to more general "ClientError" exceptions, which could mean anything. There weren't any tests to cover missing S3 objects, so I've added one. I don't think we need a test for ClientErrors: - If there was no handler, the task would fail and we'd learn about it that way. - The scope of the calling task is now much smaller, so it matters less than it used to [2]. [1]: `81a79e56ce/app/letters/utils.py (L52)` [2]: `f965322f25`	2021-12-20 12:45:48 +00:00
Ben Thorner	76da31c32a	Remove redundant conditional for letter branding This is no longer used when creating a service [1]. It was likely added at a migration point when Admin _did_ specify branding. [1]: `50c3c3e10c/app/main/views/add_service.py (L15-L22)`	2021-12-16 17:54:33 +00:00
Pea Tyczynska	6c04deaec2	Get rid of unnecessary coalesce	2021-12-14 17:36:03 +00:00
Leo Hemsted	81a79e56ce	Merge pull request #3397 from alphagov/delete-per-service run the notification delete task per service	2021-12-14 15:37:22 +00:00
Leo Hemsted	228d72dc8f	update log messages in delete task. less prose, clearer output. (hopefully)	2021-12-14 15:24:35 +00:00
Leo Hemsted	0dc0e184b9	clean up and rewrite notification_dao_delete_notifications a bunch of these tests are now covered in the task test, so got rid of some. Now that the "how long ago to delete" questions is asked in the task rather than in the dao, and only one service is looked at at a time, we don't need to worry about data retention, etc. Hopefully made the tests simpler - there may still be some duplicates or overlaps between the various cases.	2021-12-14 15:24:35 +00:00
Leo Hemsted	49cc1b643f	split delete task up into per service we really don't gain anything by running each service delete in sequence - we get the services, and then just loop through them deleting per service. By deleting per service in separate tasks, we can take advantage of parallelism. the only thing we lose is some log lines but I don't think we're that interested in them. only set query limit at the move_notifications dao function - the task doesn't really care about the technical implementation of how it deletes the notifications	2021-12-14 15:24:34 +00:00
Leo Hemsted	bbc68293bb	Merge pull request #3400 from alphagov/lxml-bump bump lxml to fix security warning	2021-12-14 14:53:32 +00:00
Ben Thorner	7dd3e1fa87	Merge pull request #3399 from alphagov/timeout-stats-180344294 Add new metrics for slow / unknown delivery	2021-12-14 14:02:34 +00:00
Leo Hemsted	d916b07e80	remove old unused scripts common_functions is full of AWS commands to manipulate workers running on ec2 instances. We haven't done any of that for years since we moved to AWS delete_sqs_queues contains scripts to get a list of sqs queues and put their details in a csv, or take a details csv and then delete all those queues. it's not clear what the use-case was for it but no-one's used it for years and we can just use the admin console if we really need to.	2021-12-14 14:02:28 +00:00
Leo Hemsted	b7c1fcb66d	bump lxml to fix security warning two vulnerabilities in <4.6.5 (GHSL-2021-1037 and GHSL-2021-1038) https://github.com/lxml/lxml/blob/master/CHANGES.txt also removes docopt as we don't use it except for a dev script (which we might not need anyway)	2021-12-14 13:47:38 +00:00
Ben Thorner	c8cf057eba	Record providers we time out notifications for This will help us monitor issues with delivery receipts and keep track of provider performance over time. I'm not concerned about performance here: - The number of notifications to time out is usually small. - This task only runs once a day. - Calls to StatsD are quick and cheap.	2021-12-14 13:04:39 +00:00
Ben Thorner	11278c47f5	Replace log with StatsD gauge for slow delivery A gauge is more useful as we can visualise it and combine it with other stats - we already have other stats for the total number of notifications sent by provider, and we can extrapolate the number of slow notifications using this, if needed. We also still have logs to say the task is running, as well as a log in the calling code when we actually make a switch [1], so we're not losing anything by removing the log here. [1]: `a9306c4557/app/celery/scheduled_tasks.py (L117)`	2021-12-14 13:03:43 +00:00
Ben Thorner	a9306c4557	Merge pull request #3398 from alphagov/infinity-timeout-180344153 Scale timeout task to work on arbitrary volumes	2021-12-14 11:21:26 +00:00
Ben Thorner	c1f0c24d82	Trim down tests for DAO timeout function a bit The first test is enough to cover that "created" and "delivered" notifications aren't affected by this function.	2021-12-13 17:17:41 +00:00
Ben Thorner	87cd40d00a	Scale timeout task to work on arbitrary volumes Previously this was limited to 500K notifications. While we don't expect to reach this limit, it's not impossible e.g. if we had a repeat of the incident where one of our providers stopped sending us status updates. Although that's not great, it's worse if our code can't cope with the unexpectedly high volume. This reuses the technique we have elsewhere [1] to keep processing in batches until there's nothing left. Specifying a cutoff point means the total amount of work to do can't keep growing. [1]: `2fb432adaf/app/dao/notifications_dao.py (L441)`	2021-12-13 17:14:28 +00:00

1 2 3 4 5 ...

8612 Commits