notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-20 15:31:15 -05:00

Author	SHA1	Message	Date
Ben Thorner	11278c47f5	Replace log with StatsD gauge for slow delivery A gauge is more useful as we can visualise it and combine it with other stats - we already have other stats for the total number of notifications sent by provider, and we can extrapolate the number of slow notifications using this, if needed. We also still have logs to say the task is running, as well as a log in the calling code when we actually make a switch [1], so we're not losing anything by removing the log here. [1]: `a9306c4557/app/celery/scheduled_tasks.py (L117)`	2021-12-14 13:03:43 +00:00
Ben Thorner	2adaaac3ae	Remove redundant conditions for update query Filtering by ID is enough, noting the other conditions were the same between both queries.	2021-12-13 17:03:07 +00:00
Ben Thorner	c8ebb365d4	Make limit of DAO timeout function more obvious We're going to iterate how we use the function with a limit, so we shouldn't say it's "temporary" anymore. We don't need to change the default, but having it in the function parameters makes it easier to see the funtion doesn't time out all notifications, just some.	2021-12-13 17:01:41 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	3bcaf8330e	Simplify comment for DAO timeout function	2021-12-13 16:39:55 +00:00
Ben Thorner	2fb432adaf	Merge pull request #3383 from alphagov/email-sms-created-alert-180344153 Add new log / alert for 'created' email / SMS	2021-12-13 12:56:05 +00:00
David McDonald	7d8eed8228	Optimise queries run for creating pagination links We have been running in to the problem in pallets/flask-sqlalchemy#518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`. This commit is analagous to `c68d1a2f23` The only difference is that in that case, the pagination links are used to show prev and/or next links in the admin app. In this case, the pagination links are only used to see if there is a page 2, and if there is, say that we are only showing the first 50 results.	2021-12-10 17:47:27 +00:00
David McDonald	1973994516	Merge pull request #3391 from alphagov/pagination-approach-change Pagination approach change for `get_notifications_for_service`	2021-12-09 10:43:14 +00:00
Ben Thorner	ab4cb029df	Remove alert for email / sms in created In response to [1]. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988 It turns out the code that inspired this new alert - in the old "timeout-sending-notifications" task - was actually redundant as we already have a task to "replay" notifications still in "created", which is much better than just alerting about them. It's possible the replayed notifications will also fail, but in both cases we should see some kind of error due to this, so I don't think we're losing anything by not having an alert.	2021-12-06 14:11:42 +00:00
Ben Thorner	97b58ed4c3	Remove unnecessary _timeout partial function It's no longer necessary to have a separate function that's now only called once. While sometimes the separation can bring clarity, here I think it's clearer to have all the code in one place, and avoid the functools complexity we had before.	2021-12-06 14:00:37 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00
Ben Thorner	f96ba5361a	Add new task to alert about created email / sms This will log an error when email or SMS notifications have been stuck in 'created' for too long - normally they should be 'sending' in seconds, noting that we have a goal of < 10s wait time for most notifications being processed our platform. In the next commits we'll decouple similar functionality from the existing 'timeout-sending-notifications' task.	2021-12-06 14:00:31 +00:00
David McDonald	c68d1a2f23	Optimise queries run for creating pagination links We have been running in to the problem in https://github.com/pallets/flask-sqlalchemy/issues/518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`.	2021-12-03 17:32:39 +00:00
Leo Hemsted	6bbec9f103	make delete notification tasks parallel by notification type we used to do this until apr 2020. Let's try doing it again. Back then, we had problems with timing. We did two things in spring 2020: We moved to using an intermediary temp table [1] We stopped the tasks being parallelised [2] However, it turned out the real time saving was from changing what services we delete for [3]. The task was actually CPU-bound rather than DB-bound, so that's probably why having the tasks in parallel wasn't helping, since they were all competing for the same CPU. It's worth trying the parallel steps again now that we're no longer CPU bound. Note: Temporary tables are in their own postgres schema, and are only viewable by the current session (session == connection. Each celery worker process has its own db connection). We don't need to worry about separate workers both trying to use the same table at once. I've also added a "DROP ON COMMIT" directive to the table definition just to ensure it doesn't persist past the task even if there's an exception. (This also drops on rollback). Cronitor looks at the three functions separately so we don't need to worry about the main task taking milliseconds where it used to take hours as it isn't monitored itself. I've also removed some unnecessary redundant exception logs. [1] https://github.com/alphagov/notifications-api/pull/2767 [2] https://github.com/alphagov/notifications-api/pull/2798 [3] https://github.com/alphagov/notifications-api/pull/3381	2021-12-01 14:28:08 +00:00
Rebecca Law	101498ec84	Improve query performance Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.	2021-11-30 16:42:32 +00:00
Leo Hemsted	bab659c677	reduce number of services we try and delete notifications for TLDR: Don't return as many services, and only return their IDs and not the whole service objects. Context: the delete notifications nightly task has been taking longer and longer, and to delete all three notification types in sequence it now takes up to 8 hours. This is because we were retrieving all services, loading them into memory on the worker, and then trying to delete notifications for each service in turn. While it does use a fair chunk of IOPS/CPU on our postgres db, we're not anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1] The real issue appears to be that the task is CPU bound on the periodic worker - we see the worker spike up to 100% CPU regularly across the whole 3am-11am period. We also noticed that for each notification type the task first processes services with custom data retention (not many but some of the biggest users), then deals with all other services. We can see from looking at kibana that, for example, the task starts at 3am, and the custom data retention service email deletions are finished by 3:12am. The rest of the emails don't get deleted until 5am, so we knew that the problem is with how it handles the other services. There are currently 17000 services in the database. On a typical day, ~800 services will have notifications that are over 7 days old and need to be deleted. By only returning these services, we reduce the amount of data transfer and serialisation that needs to happen. It takes about two minutes to retrieve the distinct service ids from the notifications table for sms notifications, but that is only 5% the size of the full list so cuts down on a lot of processing Also, by only returning service_ids rather than the whole `Service` model we avoid sqlalchemy needing to do lots of data serialisation, when we were only using the `Service.id` field from that result anyway. [1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics [2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now [3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))	2021-11-24 16:18:40 +00:00
Rebecca Law	30a5852685	Update the query to only return the count from the table since that is all we care about. https://www.pivotaltracker.com/story/show/180262357	2021-11-17 14:46:52 +00:00
David McDonald	c98996a461	Improve log message searchability for duplicate receipts There were two problems with the existing message. 1. There was no space between the new status and the time taken which made reading and searching harder 2. They key bits of information (before and after status) were separated by the time taken (which will always be unique) meaning you couldn't do an easy search for a message that is say in delivered being attempted to be set to temporary-failure.	2021-11-12 14:06:38 +00:00
Ben Thorner	77c8c0a501	Optimise query to get notifications to "time out" From experimenting in production we found a "!=" caused the engine to use a sequential scan, whereas explicitly listing all the types ensured an index scan was used. We also found that querying for many (over 100K) items leads to the task stalling - no logs, but no evidence of it running either - so we also add a limit to the query. Since the query now only returns a subset of notifications, we need to ensure the subsequent "update" query operates on the same batch. Also, as a temporary measure, we have a loop in the task code to ensure it operates on the total set of notifications to "time out", which we assume is less than 500K for the time being.	2021-11-09 13:50:32 +00:00
Katie Smith	8a34dccda0	Remove redundant join This was left over from when we needed to tell if a notification was sent by a crown or non-crown service.	2021-05-06 09:34:46 +01:00
Rebecca Law	4f196316aa	Change the query to get the services to purge to use query on the db.Model rather than db.session.query. `service_ids_to_purge` is a list of `row` object rather than a list of `UUID`. NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.	2021-04-29 13:32:36 +01:00
Rebecca Law	85895a9e8b	Revert "Scheduled weekly dependency update for week 16"	2021-04-28 10:17:16 +01:00
Rebecca Law	f941768d8c	Change the query to get the services to purge to use query on the db.Model rather than db.session.query. `service_ids_to_purge` is a list of `row` object rather than a list of `UUID`. NOTE: db.session.query(Service).filter(Service.id.notin_(services_with_data_retention)).all() would have also worked. It seems that only selecting attributes from the db.Model has caused the change.	2021-04-27 08:36:34 +01:00
Rebecca Law	d4009ffc52	Rename database management functions. Rename @transactional to @autocommit. Rename nested_transaction to tranaction.	2021-04-19 10:56:00 +01:00
Ben Thorner	c76e789f1e	Reduce extra S3 ops when working with letter PDFs Previously we did some unnecessary work: - Collate task. This had one S3 request to get a summary of the object, which was then used in another request to get the full object. We only need the size of the object, which is included in the summary [1]. - Archive task. This had one S3 request to get a summary of the object, which was then used to make another request to delete it. We still need both requests, but we can remove the S3.Object in the middle. [1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary	2021-03-16 12:53:13 +00:00
Ben Thorner	ff7eebc90a	Simplify deleting old letters Previously we made a call to S3 to list objects for a letter, even though we already had the precise key of the single object to hand. This removes the one usage of "get_s3_bucket_objects" and uses the filename directly in the call to remove the object.	2021-03-15 17:18:20 +00:00
Leo Hemsted	6784ae62a6	Raise Exception if letter PDF not in S3 Previously, the function would just return a presumed filename. Now that it actually checks s3, if the file doesn't exist it'll raise an exception. By default that's a StopIteration at the end of the bucket iterator, which isn't ideal as this will get supressed if the function is called within a generator loop further up or anything. There are a couple of places where we expect the file may not exist, so we define a custom exception to rescue specifically here. I did consider subclassing boto's ClientError, but this wasn't straightforward as the constructor expects to know the operation that failed, which for me is a signal that it's not an appropriate (re-)use of the class.	2021-03-15 17:18:11 +00:00
Ben Thorner	b43a367d5f	Relax lookup of letter PDFs in S3 buckets Previously we generated the filename we expected a letter PDF to be stored at in S3, and used that to retrieve it. However, the generated filename can change over the course of a notification's lifetime e.g. if the service changes from crown ('.C.') to non-crown ('.N.'). The prefix of the filename is stable: it's based on properties of the notification - reference and creation - that don't change. This commit changes the way we interact with letter PDFs in S3: - Uploading uses the original method to generate the full file name. The method is renamed to 'generate_' to distinguish it from the new one. - Downloading uses a new 'find_' method to get the filename using just its prefix, which makes it agnostic to changes in the filename suffix. Making this change helps to decouple our code from the requirements DVLA have on the filenames. While it means more traffic to S3, we rely on S3 in any case to download the files. From experience, we know S3 is highly reliable and performant, so don't anticipate any issues. In the tests we favour using moto to mock S3, so that the behaviour is realistic. There are a couple of places where we just mock the method, since what it returns isn't important for the test. Note that, since the new method requires a notification object, we need to change a query in one place, the columns of which were only selected to appease the original method to generate a filename.	2021-03-15 13:55:44 +00:00
David McDonald	41d95378ea	Remove everything for the performance platform We no longer will send them any stats so therefore don't need the code - the code to work out the nightly stats - the performance platform client - any configuration for the client - any nightly tasks that kick off the sending off the stats We will require a change in cronitor as we no longer will have this task run meaning we need to delete the cronitor check.	2021-03-15 12:04:53 +00:00
Ben Thorner	a91fde2fda	Run auto-correct on app/ and tests/	2021-03-12 11:45:45 +00:00
Rebecca Law	19f7a6ce38	Refactor method for deciding the failure type	2021-03-10 14:39:55 +00:00
Rebecca Law	21edf7bfdd	Persist the processing time statistics to the database. The performance platform is going away soon. The only stat that we do not have in our database is the processing time. Let me clarify the only statistic we don't have in our database that we can query efficiently is the processing time. Any queries on notification_history are too inefficient to use on a web page. Processing time = the total number of normal/team emails and text messages plus the number of messages that have gone from created to sending within 10 seconds per whole day. We can then easily calculate the percentage of messages that were marked as sending under 10 seconds.	2021-02-26 07:49:49 +00:00
Pea Tyczynska	e0c73ac342	Send daily email with letter and sheet volumes to DVLA	2021-02-23 15:13:19 +00:00
Pea Tyczynska	c8ffebcce8	Query to get letter and sheet volumes So we can send daily email with these volumes to DVLA.	2021-02-23 15:13:18 +00:00
David McDonald	e35ea57ba2	Do not delete letters if not in final state A few weeks ago, we deleted some pdf letters that had reached their retention period. However, these letters were in the 'created' state so it's very arguable that we should not have deleted them because we were expecting to resend them and were unable to. Part of the reason for this is that we marked the letters back to `created` as the status but we did not nullify the `sent_at` timestamp, meaning the check on `ebb43082d5/app/dao/notifications_dao.py (L346)` did not catch it. Regardless of that check, which controls whether the files were removed from S3, they were also archived into the `notification_history` table as by default. This commit does changes our code such that letters that are not in their final state do not go through our retention process. This could mean they violate their retention policy but that is likely the lesser of two evils (the other being we delete them and are unable to resend them). Note, `sending` letters have been included in those not to be removed because there is a risk that we give the letter to DVLA and put it in `sending` but then they come back to us later telling us they've had problems and require us to resend.	2020-12-16 10:50:11 +00:00
David McDonald	47e146f010	Move variable out of loop that didn't need to be	2020-12-16 10:40:30 +00:00
David McDonald	1bf9b29905	Document behaviour of s3 letter deleting The behaviour was a bit of opaque so I have added tests around it so it's clear what it is doing and why. No functionality has changed	2020-12-16 10:39:31 +00:00
Leo Hemsted	3bc3ed88b3	use yield_per instead of limit limit means we only return 50k letters, if there are more than that for a service we'll skip them and they won't be picked up until the next day. If you remove the limit, sqlalchemy prefetches query results so it can build up ORM results, for example collapsing joined rows into single objects with chidren. SQLAlchemy streams the data into a buffer, and normally will still prefetch the entire resultset so it can ensure integrity of the session, (so that if you modify one result that is duplicated further down in the results, both rows are updated in the session for example). However, we don't care about that, but we do care about preventing the result set taking up too much memory. We can use `yield_per` to yield from sqlalchemy to the iterator (in this case the `for letter in letters_awaiting_sending` loop in letters_pdf_tasks.py) - this means every time we hit 10000 rows, we go back to the database to get the next 10k. This way, we only ever need 10k rows in memory at a time. This has some caveats, mostly around how we handle the data the query returns. They're a bit hard to parse but I'm pretty sure the notable limitations are: * It's dangerous to modify ORM objects returned by yield_per queries * It's dangerous to join in a yield_per query if you think there will be more than one row per item (for example, if you join from notification to service, there'll be multiple result rows containing the same service, and if these are split over different yield chunks, then we may experience undefined behaviour. These two limitations are focused around there being no guarantee of having one unique row per item. For more reading: https://docs.sqlalchemy.org/en/13/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per https://www.mail-archive.com/sqlalchemy@googlegroups.com/msg12443.html	2020-10-26 13:01:34 +00:00
Leo Hemsted	ed182c2a22	return just the columns we need for collating letters previously we were returning the entire ORM object. Returning columns has a couple of benefits: * Means we can join on to services there and then, avoiding second queries to get the crown status of the service later in the collate flow. * Massively reduces the amount of data we return - particularly free text fields like personalisation that could be potentially quite big. 5 columns rather than 26 columns. * Minor thing, but will skip some CPU cycles as sqlalchemy will no longer construct an ORM object and try and keep track of changes. We know this function doesn't change any of the values to persist them back, so this is an unnecessary step from sqlalchemy. Disadvantages are: * The dao_get_letters_to_be_printed return interface is now much more tightly coupled to the get_key_and_size_of_letters_to_be_sent_to_print function that calls it.	2020-10-23 20:01:18 +01:00
Leo Hemsted	f8fbaba77d	limit collate task to 50,000 messages we've seen issues where tasks mysteriously hang and do not process large volumes of letters - in this case >150k letters in created state. to try and get at least some letters out of the door, limit the query to only return 50k letters per postage type. We may need to run the task multiple times, or letters may get delayed until the next day when they'd be picked up (provided there's enough capacity then). The task should only be re-run AFTER the ftp tasks have all finished, and updated the letters to sending, or we run the risk of sending the same letters twice. For context, the largest ever letter day we've sent is ~65k in march of this year.	2020-10-23 15:28:34 +01:00
Leo Hemsted	4b61060d32	stream notifications when collating zip files we had issues where we had 150k 2nd class notifications, and the collate task never ran properly, presumably because the volume of data being returned was too big. to try and help with this, we can switch to streaming rather than using `.all` and building up lists of data. This should help, though the initial query may be a problem still.	2020-10-23 12:20:26 +01:00
Pea Tyczynska	9ac65ee95c	Start sending letters from insolvency service again	2020-10-21 16:56:18 +01:00
Pea Tyczynska	30bd311eb1	Temporarily do not send letters from Insolvency Service to DVLA. This is a temporary measure over the weekend so that DVLA can catch up with all other letters. We should revert this on Monday 19.10.2020	2020-10-16 16:13:32 +01:00
Rebecca Law	b2ff4277c9	Adding service_id to the sort order for the letters being sent to print. We have had a few instances where letters have caused problems. Particularly for precompiled letters, often the issue comes from the same service. The hope is that by adding a sort order this will help the print provider narrow down the problem. There is a small degradation of the performance of the query, but it's not enough to concern me.	2020-10-15 09:39:07 +01:00
Leo Hemsted	1e928a926a	rename sending_date to created_at we don't name letters based on the day we send them on, rather, the day we create them on. If we process a letter for a second time for whatever reason, even if it's a couple of days later, it'll still go in a folder based on the created_at timestamp. There's still a slight confusion, however - if the timestamp is after 5:30pm, the folder will be for the day after. However, still the day after creation, so I think created_at still makes the most sense. Remove the term `sending_date` to try and make this relationship more apparent.	2020-09-21 14:40:22 +01:00
Leo Hemsted	bb33927b3d	rename letter get_folder_name args `_now`? why would we ever use a different _now? instead say created_at, because that's what it'll always be set to, even if we're replaying old letters. We always set the folder name to when the letter was created_at, or we might not know where to look to find it. `dont_use_sending_date` doesn't really tell us what might happen if we don't use it - the answer is we return an empty string. we ignore the folder entirely. so lets call it that. Also, remove use of freeze_gun in the tests, to prove that we don't use the current time in any calculations. Also add an assert to a mock in the get_pdf_for_templated_letter test, because we were mocking but not asserting before, so the tests didn't fail when the function signature changed.	2020-09-21 14:32:57 +01:00
Chris Hill-Scott	0e1f6f31e4	Use constant for notification type Co-authored-by: Katie Smith <klssmith@users.noreply.github.com>	2020-09-09 11:12:06 +01:00
Chris Hill-Scott	cfda289746	Allow international letters to be cancelled Our code was assuming that any notifications with `international` set to `True` were text messages. It was then trying to look up delivery information for a notification which wasn’t sent to a phone number, causing an exception.	2020-09-09 10:55:55 +01:00
Pea M. Tyczynska	fbdfa6416f	Merge pull request #2921 from alphagov/remove-statsd-http-api-decorators Remove statsd http api decorators and turn statsd back on for celery apps	2020-07-14 10:16:44 +01:00
Pea M. Tyczynska	9186083ea7	Merge pull request #2796 from alphagov/split-letters-into-zips-based-on-postage Split letters into zips based on postage	2020-07-08 11:49:21 +01:00

1 2 3 4 5 ...

449 Commits