notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2026-01-28 21:41:45 -05:00

Author	SHA1	Message	Date
Ben Thorner	de9ae08ecc	Downgrade log for letter deletion exceptions If the S3 object is missing [1], then that's what we want, so we don't need such a severe log for it, but we still want to know as it's not expected. This is separate to more general "ClientError" exceptions, which could mean anything. There weren't any tests to cover missing S3 objects, so I've added one. I don't think we need a test for ClientErrors: - If there was no handler, the task would fail and we'd learn about it that way. - The scope of the calling task is now much smaller, so it matters less than it used to [2]. [1]: `81a79e56ce/app/letters/utils.py (L52)` [2]: `f965322f25`	2021-12-20 12:45:48 +00:00
Leo Hemsted	0dc0e184b9	clean up and rewrite notification_dao_delete_notifications a bunch of these tests are now covered in the task test, so got rid of some. Now that the "how long ago to delete" questions is asked in the task rather than in the dao, and only one service is looked at at a time, we don't need to worry about data retention, etc. Hopefully made the tests simpler - there may still be some duplicates or overlaps between the various cases.	2021-12-14 15:24:35 +00:00
Ben Thorner	c1f0c24d82	Trim down tests for DAO timeout function a bit The first test is enough to cover that "created" and "delivered" notifications aren't affected by this function.	2021-12-13 17:17:41 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	b81a66da50	Fix assertions in tests for timeout DAO function Previously most of the assertions were being run before we had actually called the function. There was also a redundant block of assertions that just asserted the initial state of the test data.	2021-12-13 16:48:30 +00:00
Ben Thorner	2fb432adaf	Merge pull request #3383 from alphagov/email-sms-created-alert-180344153 Add new log / alert for 'created' email / SMS	2021-12-13 12:56:05 +00:00
David McDonald	7d8eed8228	Optimise queries run for creating pagination links We have been running in to the problem in pallets/flask-sqlalchemy#518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`. This commit is analagous to `c68d1a2f23` The only difference is that in that case, the pagination links are used to show prev and/or next links in the admin app. In this case, the pagination links are only used to see if there is a page 2, and if there is, say that we are only showing the first 50 results.	2021-12-10 17:47:27 +00:00
Pea Tyczynska	a74d1b026f	Fix calculating remaining free allowance for SMS The way it was done before, the remainder was incorrect in the billing report and in the org usage query - it was the sms remainder left at the start of the report period, not at the end of that period. This became apparent when we tried to show sms_remainder on the org usage report, where start date is always the start of the financial year. We saw that sms sent by services did not reduce their free allowance remainder according to the report. As a result of this, we had to temporarily remove of sms_remainder column from the report, until we fix the bug - it has been fixed now, yay! I think the bug has snuck in partially because our fixtures for testing this part of the code are quite complex, so it was harder to see that numbers don't add up. I have added comments to the tests to try and make it a bit clearer why the results are as they are. I also added comments to the code, and renamed some variables, to make it easier to understand, as there are quite a few moving parts in it - subqueries and the like. I also renamed the fetch_sms_free_allowance_remainder method to fetch_sms_free_allowance_remainder_until_date so it is clearer what it does.	2021-12-09 18:58:10 +00:00
Ben Thorner	ab4cb029df	Remove alert for email / sms in created In response to [1]. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988 It turns out the code that inspired this new alert - in the old "timeout-sending-notifications" task - was actually redundant as we already have a task to "replay" notifications still in "created", which is much better than just alerting about them. It's possible the replayed notifications will also fail, but in both cases we should see some kind of error due to this, so I don't think we're losing anything by not having an alert.	2021-12-06 14:11:42 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00
Ben Thorner	f96ba5361a	Add new task to alert about created email / sms This will log an error when email or SMS notifications have been stuck in 'created' for too long - normally they should be 'sending' in seconds, noting that we have a goal of < 10s wait time for most notifications being processed our platform. In the next commits we'll decouple similar functionality from the existing 'timeout-sending-notifications' task.	2021-12-06 14:00:31 +00:00
Leo Hemsted	bab659c677	reduce number of services we try and delete notifications for TLDR: Don't return as many services, and only return their IDs and not the whole service objects. Context: the delete notifications nightly task has been taking longer and longer, and to delete all three notification types in sequence it now takes up to 8 hours. This is because we were retrieving all services, loading them into memory on the worker, and then trying to delete notifications for each service in turn. While it does use a fair chunk of IOPS/CPU on our postgres db, we're not anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1] The real issue appears to be that the task is CPU bound on the periodic worker - we see the worker spike up to 100% CPU regularly across the whole 3am-11am period. We also noticed that for each notification type the task first processes services with custom data retention (not many but some of the biggest users), then deals with all other services. We can see from looking at kibana that, for example, the task starts at 3am, and the custom data retention service email deletions are finished by 3:12am. The rest of the emails don't get deleted until 5am, so we knew that the problem is with how it handles the other services. There are currently 17000 services in the database. On a typical day, ~800 services will have notifications that are over 7 days old and need to be deleted. By only returning these services, we reduce the amount of data transfer and serialisation that needs to happen. It takes about two minutes to retrieve the distinct service ids from the notifications table for sms notifications, but that is only 5% the size of the full list so cuts down on a lot of processing Also, by only returning service_ids rather than the whole `Service` model we avoid sqlalchemy needing to do lots of data serialisation, when we were only using the `Service.id` field from that result anyway. [1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics [2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now [3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))	2021-11-24 16:18:40 +00:00
David McDonald	106187ba04	Fix division by zero error on performance page For preview and staging environments, we often send no messages in a single day. This is currently causing a `DivisionByZero` error that is rendering the page with no results. This makes it impossible to look at preview/staging and see if the performance page is working correctly or not. (psycopg2.errors.DivisionByZero) division by zero [SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage FROM ft_processing_time WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date] [parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}] (Background on this error at: http://sqlalche.me/e/14/9h9h) I've fixed this by falling back to 100.0% for days we send no messages. Maybe some argument that it should be N/A rather than 100% but I think it doesn't really matter as this is only going to affect preview and staging as we will never have a day sending no messages in production.	2021-11-22 11:11:52 +00:00
Chris Hill-Scott	19ad11e383	Don’t repeat digits in security codes People with dyslexia and dyscalculia find it difficult to transpose codes which have consecutive, repeated digits[1]. This commits enhances the algorithm for generating codes to not repeat the previous digit in a code. This reduces the key space for our codes from 100,000 possibilities to 65,610 possibilities. 1. https://twitter.com/annaecook/status/1442567679710150662	2021-09-30 10:24:17 +01:00
Ben Thorner	be7272b44f	Merge pull request #3314 from alphagov/next-stage-areas-migration-178986763 Switch "areas" field to "areas_2" format	2021-09-01 16:16:30 +01:00
Chris Hill-Scott	2c7e4657ce	Don’t update `email_access_validated_at` on password reset As of https://github.com/alphagov/notifications-admin/pull/4000/files the admin app is doing this, so we don’t need to do it here as well.	2021-09-01 09:54:54 +01:00
Ben Thorner	ec1171f85c	Switch "areas" field to "areas_2" format The Admin app is now temporarily using the "areas_2" field while we migrate "areas" to the new format [1]. [1]: https://github.com/alphagov/notifications-admin/pull/4004	2021-08-27 14:22:11 +01:00
Ben Thorner	8f39d476bd	Start dual running with "areas" and (area) "ids" This is necessary until: - The Admin app is using the new "areas(_2)" format to store and retrieve data. - We've migrated all existing broadcast messages to use the new format. Note that "areas" / "ids" isn't actually used for anything except printing out the PagerDuty message - it's not sent to the proxy [1]. [1]: `6edc6c70aa/app/celery/broadcast_message_tasks.py (L190-L193)`	2021-08-26 15:34:24 +01:00
Pea Tyczynska	9d2f8347b2	get_broadcasts returns a list of broadcasts for gov.uk/alerts	2021-08-12 14:03:33 +01:00
Pea Tyczynska	0f7f219a55	dao_get_all_broadcast_messages returns just fields govuk alerts need	2021-08-11 14:43:27 +01:00
Pea Tyczynska	74c9ca2bf6	Fetch all broadcast messages that are or were transmitted Regardless of channel. Do not include: - broadcasts older than 25.05.2021 - stubbed broadcasts - broadcasts that were not transmitted. So only broadcasting, cancelled and completed make the list;	2021-08-11 14:43:27 +01:00
Pea Tyczynska	c28e9451d4	Bump moto version to try solve dependencies version conflict Also update mock import statements in some test files as they stopped working with this dependency update.	2021-07-08 15:37:19 +01:00
Rebecca Law	35b20ba363	Correct the daily limits cache. Last year we had an issue with the daily limit cache and the query that was populating it. As a result we have not been checking the daily limit properly. This PR should correct all that. The daily limit cache is not being incremented in app.notifications.process_notifications.persist_notification, this method is and should always be the only method used to create a notification. We increment the daily limit cache is redis is enabled (and it is always enabled for production) and the key type for the notification is team or normal. We check if the daily limit is exceed in many places: - app.celery.tasks.process_job - app.v2.notifications.post_notifications.post_notification - app.v2.notifications.post_notifications.post_precompiled_letter_notification - app.service.send_notification.send_one_off_notification - app.service.send_notification.send_pdf_letter_notification If the daily limits cache is not found, set the cache to 0 with an expiry of 24 hours. The daily limit cache key is service_id-yyy-mm-dd-count, so each day a new cache is created. The best thing about this PR is that the app.service_dao.fetch_todays_total_message_count query has been removed. This query was not performant and had been wrong for ages.	2021-06-22 16:15:36 +01:00
Rebecca Law	d4a42471cb	Merge pull request #3267 from alphagov/fix-daily-totals-query Improve the query to get today's totals for a service.	2021-06-16 07:34:01 +01:00
Rebecca Law	08bb5c657f	Fix the query to get todays totals for a service. The query had a group by on notification_type and notification_status, this not only slows the query down but is wrong. The query only looked at the first result, but this query would return as many rows as different notification types and status, meaning the results do not include the correct number. Are we concerned that all status types are included. For example letters can be cancelled or have validation-failures which shouldn't be included in the daily limit check.	2021-06-14 15:29:21 +01:00
Katie Smith	0148b3dba6	Add new total_letters field to the billing report data This adds total_letters to the data that is returned by the `/platform-stats/data-for-billing-report` endpoint so that we can add total letters as a column in the CSV file that can be downloaded.	2021-06-11 11:31:22 +01:00
Rebecca Law	684a882cf3	Revert "Do not include today's totals"	2021-06-02 16:06:33 +01:00
Rebecca Law	c668bed9d3	Merge pull request #3256 from alphagov/no-totals-for-high-volume-services Do not include today's totals	2021-06-02 15:08:45 +01:00
Rebecca Law	a341536de0	- Add comment to test and new if statement - Update assert in test	2021-06-02 14:13:31 +01:00
Rebecca Law	b170b5ed80	This change is a temporary fix to allow users for high volume services to use the admin app. The trouble is the aggregate query to return the big blue numbers on the dashboard and /notifications/{notification_type} page is taking too long to return. I have some ideas on how to improve the query, but should take some time to do some more research and test. In the meantime, let's just ignore "todays" total numbers for the high volume services. There are only two services that this will affect.	2021-06-02 10:31:38 +01:00
Rebecca Law	d4b5373ee5	Update tests for BST to be on the day BST starts. Add a test for just after midnight of the date the stats are collected.	2021-06-01 16:07:37 +01:00
David McDonald	04e23ca6a9	Revert "Bump utils version for new invalid address character"	2021-06-01 10:53:28 +01:00
Rebecca Law	dfacba5053	Add a test for a bst date	2021-05-26 11:50:05 +01:00
Rebecca Law	f66f0a2e2d	Bump moto dependency to resolve dependency conflicts	2021-05-25 14:20:25 +01:00
Rebecca Law	68d28aa83b	The update of SQLAlchemy 1.4.10 has caused some conflicts in our code. This PR fixes most of those conflicts. - sqlalchemy.sql.expression.case must include an else statement. - clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors. - remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx. - update queries now that he user relationship in ServiceUser db model has been removed. - move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done. Deleting notifications in the night tasks still needs to be investigated. The raw sql is causing an error.	2021-04-29 13:32:36 +01:00
Pea Tyczynska	df19a91b7f	Update error messages after SQLAlchemy version update	2021-04-29 13:32:36 +01:00
Rebecca Law	85895a9e8b	Revert "Scheduled weekly dependency update for week 16"	2021-04-28 10:17:16 +01:00
Rebecca Law	1b070d69a1	The update of SQLAlchemy 1.4.10 has caused some conflicts in our code. This PR fixes most of those conflicts. - sqlalchemy.sql.expression.case must include an else statement. - clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors. - remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx. - update queries now that he user relationship in ServiceUser db model has been removed. - move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done. Deleting notifications in the night tasks still needs to be investigated. The raw sql is causing an error.	2021-04-26 11:50:30 +01:00
Pea Tyczynska	e1a626855d	Update error messages after SQLAlchemy version update	2021-04-23 16:48:36 +01:00
Rebecca Law	93908bacda	New strategy for transaction management. Introduce a contextmanger function to handle exceptions and nested transactions. Using the nested_transaction will start a nested transaction with `db.session.begin_nested`, once the nested transaction is complete the commit will happen. `@transactional` has been updated to commit unless in a nested transaction.	2021-04-14 07:04:17 +01:00
Rebecca Law	cf35135605	Adding @nested_transactional for transactions that require more than one db update/insert. Using a savepoint for the multiple transactions allows us to rollback if there is an error when executing the second db transaction. However, this does add a bit of complexity. Developers need to manage the db session when calling multiple nested tranactions. Unit tests have been added to test this functionality and some end to end tests have been done to make sure all transactions are rollback if there is an exception while executing the transaction.	2021-04-14 07:03:57 +01:00
Rebecca Law	69e5ddae4f	When a service is associated with a organisation set the free allowance to the default free allowance for the organisation type. The update/insert for the default free allowance is done in a separate transaction. Updates to services need to happen in a transaction to trigger the insert into the ServicesHistory table. For that reason the call to set_default_free_allowance_for_service is done after the service is updated. I've added a try/except around the set_default_free_allowance_for_service call to ensure we still get the update to the service but get an exception log if the update to annual_billing fails. I believe it's important to preserve the update to the service in the unlikely event that the annual_billing upsert fails.	2021-04-14 07:03:57 +01:00
Rebecca Law	da8a7a8db1	Avoid key errors by setting the year_start with 2020 or 2021 Remove db.create_service_with_organisation method Update comment in command	2021-03-30 09:08:04 +01:00
Rebecca Law	7da5abc17b	The free sms allowances are changing for the financial year starting April 1 2021. In this PR there is a command to set annual_billing for all active services with the the new defaults. The new method `set_default_free_allowance_for_service` will also be called in a PR to follow that will set a services free allowance to the default if the organisation for the service is changed.	2021-03-29 13:32:00 +01:00
Ben Thorner	b2b14f39a3	Merge pull request #3183 from alphagov/remove-crown-letter-filename Remove non/crown indicator in letter filenames	2021-03-24 13:06:58 +00:00
Pea Tyczynska	0dbe4b27c8	Rearrange fixture for readability	2021-03-19 16:50:01 +00:00
Pea Tyczynska	100d47f4e8	Refactor test and fixture for getting billing report data Names of services and orgs were confusing, and variable setting was done in a way that made it easy to introduce errors. Now hopefully it is more readable and more error-proof.	2021-03-19 16:50:00 +00:00
Ben Thorner	8219b3c032	Remove non/crown indicator in letter filenames This is not required by DVLA and since [1] we no longer care about the end of letter filenames when collating them, so removing it is safe to do. Note that the name of the ZIP files of collated letters is based on a hash of the filenames, which needed updating in tests. Before merging this we need to do a test run in Staging, so DVLA can check that a mixture of the old / new filenames won't cause issues. [1]: https://github.com/alphagov/notifications-api/pull/3172	2021-03-18 13:05:12 +00:00
Ben Thorner	c76e789f1e	Reduce extra S3 ops when working with letter PDFs Previously we did some unnecessary work: - Collate task. This had one S3 request to get a summary of the object, which was then used in another request to get the full object. We only need the size of the object, which is included in the summary [1]. - Archive task. This had one S3 request to get a summary of the object, which was then used to make another request to delete it. We still need both requests, but we can remove the S3.Object in the middle. [1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary	2021-03-16 12:53:13 +00:00
Ben Thorner	ff7eebc90a	Simplify deleting old letters Previously we made a call to S3 to list objects for a letter, even though we already had the precise key of the single object to hand. This removes the one usage of "get_s3_bucket_objects" and uses the filename directly in the call to remove the object.	2021-03-15 17:18:20 +00:00

... 2 3 4 5 6 ...

1325 Commits