notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-20 15:31:15 -05:00

Author	SHA1	Message	Date
Pea Tyczynska	a4c20e8ba6	Return 404 if reference from cancel message does not match If the reference from cancel CAP XML we received via API does not match with any existing broadcast, return 404. Do the same if service id doesn't match. Also refactor code to cancel broadcast out into separate function It should be a separate function that is only called by create_broadcast function. This will prevent create_broadcast from becoming too big and complex and doing too many things.	2022-01-19 15:42:27 +00:00
Pea Tyczynska	940126abfb	Reject unapproved broadcast upon cancel API request When a service sends us a cancel broadcast XML via API, if that broadcast was not approved yet, reject it.	2022-01-19 15:41:38 +00:00
Katie Smith	ed725c1513	Add endpoint to allow org team members to be removed This is similar to the corresponding endpoint for services. However, it is a little simpler since we don't need to worry about always having at least one team member for an organisation. The new dao function added, `dao_remove_user_from_organisation`, is also simpler than `dao_remove_user_from_service` since we don't have any organisation permissions to deal with.	2022-01-11 15:20:48 +00:00
Ben Thorner	081e0cab88	Merge pull request #3417 from alphagov/optimise-status-query-180693991 Optimise query to populate notification statuses	2022-01-11 14:18:36 +00:00
Ben Thorner	63b5204fb0	Optimise query to populate notification statuses Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification history table shows this is another instance of [1] but for the key type column. Swapping "!=" for "IN" solves the problem. [1]: https://github.com/alphagov/notifications-api/pull/3360	2022-01-11 13:22:04 +00:00
Rebecca Law	2257cae398	Fix bug in organisation report for its services and usages. If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing. There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.	2022-01-11 10:04:36 +00:00
Pea Tyczynska	32cd7a0eb6	Merge pull request #3395 from alphagov/fix_org_usage_report Fix calculating remaining free allowance for SMS	2021-12-21 15:02:54 +00:00
Ben Thorner	f65fb519c7	Merge pull request #3404 from alphagov/remove-redundant-conditional-180477467 Remove redundant conditional for letter branding	2021-12-21 13:35:59 +00:00
Ben Thorner	de9ae08ecc	Downgrade log for letter deletion exceptions If the S3 object is missing [1], then that's what we want, so we don't need such a severe log for it, but we still want to know as it's not expected. This is separate to more general "ClientError" exceptions, which could mean anything. There weren't any tests to cover missing S3 objects, so I've added one. I don't think we need a test for ClientErrors: - If there was no handler, the task would fail and we'd learn about it that way. - The scope of the calling task is now much smaller, so it matters less than it used to [2]. [1]: `81a79e56ce/app/letters/utils.py (L52)` [2]: `f965322f25`	2021-12-20 12:45:48 +00:00
Ben Thorner	76da31c32a	Remove redundant conditional for letter branding This is no longer used when creating a service [1]. It was likely added at a migration point when Admin _did_ specify branding. [1]: `50c3c3e10c/app/main/views/add_service.py (L15-L22)`	2021-12-16 17:54:33 +00:00
Pea Tyczynska	6c04deaec2	Get rid of unnecessary coalesce	2021-12-14 17:36:03 +00:00
Leo Hemsted	49cc1b643f	split delete task up into per service we really don't gain anything by running each service delete in sequence - we get the services, and then just loop through them deleting per service. By deleting per service in separate tasks, we can take advantage of parallelism. the only thing we lose is some log lines but I don't think we're that interested in them. only set query limit at the move_notifications dao function - the task doesn't really care about the technical implementation of how it deletes the notifications	2021-12-14 15:24:34 +00:00
Ben Thorner	11278c47f5	Replace log with StatsD gauge for slow delivery A gauge is more useful as we can visualise it and combine it with other stats - we already have other stats for the total number of notifications sent by provider, and we can extrapolate the number of slow notifications using this, if needed. We also still have logs to say the task is running, as well as a log in the calling code when we actually make a switch [1], so we're not losing anything by removing the log here. [1]: `a9306c4557/app/celery/scheduled_tasks.py (L117)`	2021-12-14 13:03:43 +00:00
Ben Thorner	2adaaac3ae	Remove redundant conditions for update query Filtering by ID is enough, noting the other conditions were the same between both queries.	2021-12-13 17:03:07 +00:00
Ben Thorner	c8ebb365d4	Make limit of DAO timeout function more obvious We're going to iterate how we use the function with a limit, so we shouldn't say it's "temporary" anymore. We don't need to change the default, but having it in the function parameters makes it easier to see the funtion doesn't time out all notifications, just some.	2021-12-13 17:01:41 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	3bcaf8330e	Simplify comment for DAO timeout function	2021-12-13 16:39:55 +00:00
Ben Thorner	2fb432adaf	Merge pull request #3383 from alphagov/email-sms-created-alert-180344153 Add new log / alert for 'created' email / SMS	2021-12-13 12:56:05 +00:00
David McDonald	7d8eed8228	Optimise queries run for creating pagination links We have been running in to the problem in pallets/flask-sqlalchemy#518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`. This commit is analagous to `c68d1a2f23` The only difference is that in that case, the pagination links are used to show prev and/or next links in the admin app. In this case, the pagination links are only used to see if there is a page 2, and if there is, say that we are only showing the first 50 results.	2021-12-10 17:47:27 +00:00
Pea Tyczynska	a74d1b026f	Fix calculating remaining free allowance for SMS The way it was done before, the remainder was incorrect in the billing report and in the org usage query - it was the sms remainder left at the start of the report period, not at the end of that period. This became apparent when we tried to show sms_remainder on the org usage report, where start date is always the start of the financial year. We saw that sms sent by services did not reduce their free allowance remainder according to the report. As a result of this, we had to temporarily remove of sms_remainder column from the report, until we fix the bug - it has been fixed now, yay! I think the bug has snuck in partially because our fixtures for testing this part of the code are quite complex, so it was harder to see that numbers don't add up. I have added comments to the tests to try and make it a bit clearer why the results are as they are. I also added comments to the code, and renamed some variables, to make it easier to understand, as there are quite a few moving parts in it - subqueries and the like. I also renamed the fetch_sms_free_allowance_remainder method to fetch_sms_free_allowance_remainder_until_date so it is clearer what it does.	2021-12-09 18:58:10 +00:00
David McDonald	1973994516	Merge pull request #3391 from alphagov/pagination-approach-change Pagination approach change for `get_notifications_for_service`	2021-12-09 10:43:14 +00:00
Ben Thorner	ab4cb029df	Remove alert for email / sms in created In response to [1]. [1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988 It turns out the code that inspired this new alert - in the old "timeout-sending-notifications" task - was actually redundant as we already have a task to "replay" notifications still in "created", which is much better than just alerting about them. It's possible the replayed notifications will also fail, but in both cases we should see some kind of error due to this, so I don't think we're losing anything by not having an alert.	2021-12-06 14:11:42 +00:00
Ben Thorner	97b58ed4c3	Remove unnecessary _timeout partial function It's no longer necessary to have a separate function that's now only called once. While sometimes the separation can bring clarity, here I think it's clearer to have all the code in one place, and avoid the functools complexity we had before.	2021-12-06 14:00:37 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00
Ben Thorner	f96ba5361a	Add new task to alert about created email / sms This will log an error when email or SMS notifications have been stuck in 'created' for too long - normally they should be 'sending' in seconds, noting that we have a goal of < 10s wait time for most notifications being processed our platform. In the next commits we'll decouple similar functionality from the existing 'timeout-sending-notifications' task.	2021-12-06 14:00:31 +00:00
David McDonald	c68d1a2f23	Optimise queries run for creating pagination links We have been running in to the problem in https://github.com/pallets/flask-sqlalchemy/issues/518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`.	2021-12-03 17:32:39 +00:00
Leo Hemsted	6bbec9f103	make delete notification tasks parallel by notification type we used to do this until apr 2020. Let's try doing it again. Back then, we had problems with timing. We did two things in spring 2020: We moved to using an intermediary temp table [1] We stopped the tasks being parallelised [2] However, it turned out the real time saving was from changing what services we delete for [3]. The task was actually CPU-bound rather than DB-bound, so that's probably why having the tasks in parallel wasn't helping, since they were all competing for the same CPU. It's worth trying the parallel steps again now that we're no longer CPU bound. Note: Temporary tables are in their own postgres schema, and are only viewable by the current session (session == connection. Each celery worker process has its own db connection). We don't need to worry about separate workers both trying to use the same table at once. I've also added a "DROP ON COMMIT" directive to the table definition just to ensure it doesn't persist past the task even if there's an exception. (This also drops on rollback). Cronitor looks at the three functions separately so we don't need to worry about the main task taking milliseconds where it used to take hours as it isn't monitored itself. I've also removed some unnecessary redundant exception logs. [1] https://github.com/alphagov/notifications-api/pull/2767 [2] https://github.com/alphagov/notifications-api/pull/2798 [3] https://github.com/alphagov/notifications-api/pull/3381	2021-12-01 14:28:08 +00:00
Rebecca Law	101498ec84	Improve query performance Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.	2021-11-30 16:42:32 +00:00
Leo Hemsted	bab659c677	reduce number of services we try and delete notifications for TLDR: Don't return as many services, and only return their IDs and not the whole service objects. Context: the delete notifications nightly task has been taking longer and longer, and to delete all three notification types in sequence it now takes up to 8 hours. This is because we were retrieving all services, loading them into memory on the worker, and then trying to delete notifications for each service in turn. While it does use a fair chunk of IOPS/CPU on our postgres db, we're not anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1] The real issue appears to be that the task is CPU bound on the periodic worker - we see the worker spike up to 100% CPU regularly across the whole 3am-11am period. We also noticed that for each notification type the task first processes services with custom data retention (not many but some of the biggest users), then deals with all other services. We can see from looking at kibana that, for example, the task starts at 3am, and the custom data retention service email deletions are finished by 3:12am. The rest of the emails don't get deleted until 5am, so we knew that the problem is with how it handles the other services. There are currently 17000 services in the database. On a typical day, ~800 services will have notifications that are over 7 days old and need to be deleted. By only returning these services, we reduce the amount of data transfer and serialisation that needs to happen. It takes about two minutes to retrieve the distinct service ids from the notifications table for sms notifications, but that is only 5% the size of the full list so cuts down on a lot of processing Also, by only returning service_ids rather than the whole `Service` model we avoid sqlalchemy needing to do lots of data serialisation, when we were only using the `Service.id` field from that result anyway. [1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics [2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now [3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))	2021-11-24 16:18:40 +00:00
David McDonald	18776e4160	Merge pull request #3377 from alphagov/zero-case-performance-page Fix division by zero error on performance page	2021-11-22 13:44:32 +00:00
David McDonald	106187ba04	Fix division by zero error on performance page For preview and staging environments, we often send no messages in a single day. This is currently causing a `DivisionByZero` error that is rendering the page with no results. This makes it impossible to look at preview/staging and see if the performance page is working correctly or not. (psycopg2.errors.DivisionByZero) division by zero [SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage FROM ft_processing_time WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date] [parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}] (Background on this error at: http://sqlalche.me/e/14/9h9h) I've fixed this by falling back to 100.0% for days we send no messages. Maybe some argument that it should be N/A rather than 100% but I think it doesn't really matter as this is only going to affect preview and staging as we will never have a day sending no messages in production.	2021-11-22 11:11:52 +00:00
Rebecca Law	30a5852685	Update the query to only return the count from the table since that is all we care about. https://www.pivotaltracker.com/story/show/180262357	2021-11-17 14:46:52 +00:00
David McDonald	c98996a461	Improve log message searchability for duplicate receipts There were two problems with the existing message. 1. There was no space between the new status and the time taken which made reading and searching harder 2. They key bits of information (before and after status) were separated by the time taken (which will always be unique) meaning you couldn't do an easy search for a message that is say in delivered being attempted to be set to temporary-failure.	2021-11-12 14:06:38 +00:00
Ben Thorner	77c8c0a501	Optimise query to get notifications to "time out" From experimenting in production we found a "!=" caused the engine to use a sequential scan, whereas explicitly listing all the types ensured an index scan was used. We also found that querying for many (over 100K) items leads to the task stalling - no logs, but no evidence of it running either - so we also add a limit to the query. Since the query now only returns a subset of notifications, we need to ensure the subsequent "update" query operates on the same batch. Also, as a temporary measure, we have a loop in the task code to ensure it operates on the total set of notifications to "time out", which we assume is less than 500K for the time being.	2021-11-09 13:50:32 +00:00
Chris Hill-Scott	19ad11e383	Don’t repeat digits in security codes People with dyslexia and dyscalculia find it difficult to transpose codes which have consecutive, repeated digits[1]. This commits enhances the algorithm for generating codes to not repeat the previous digit in a code. This reduces the key space for our codes from 100,000 possibilities to 65,610 possibilities. 1. https://twitter.com/annaecook/status/1442567679710150662	2021-09-30 10:24:17 +01:00
Chris Hill-Scott	2c7e4657ce	Don’t update `email_access_validated_at` on password reset As of https://github.com/alphagov/notifications-admin/pull/4000/files the admin app is doing this, so we don’t need to do it here as well.	2021-09-01 09:54:54 +01:00
Pea Tyczynska	9d2f8347b2	get_broadcasts returns a list of broadcasts for gov.uk/alerts	2021-08-12 14:03:33 +01:00
Pea Tyczynska	0f7f219a55	dao_get_all_broadcast_messages returns just fields govuk alerts need	2021-08-11 14:43:27 +01:00
Pea Tyczynska	74c9ca2bf6	Fetch all broadcast messages that are or were transmitted Regardless of channel. Do not include: - broadcasts older than 25.05.2021 - stubbed broadcasts - broadcasts that were not transmitted. So only broadcasting, cancelled and completed make the list;	2021-08-11 14:43:27 +01:00
Chris Hill-Scott	132411be24	Don’t re-expire old keys If a key has already been expired we don’t want to lose the information about when that happened by giving it a new expiry date.	2021-07-30 11:56:51 +01:00
Chris Hill-Scott	43bcb56ff4	Revoke API keys when changing broadcast settings On a regular Notify service anyone with permission can create an API key. If this service then is given permission to send emergency alerts it will have an API key which can create emergency alerts. This feels dangerous. Secondly, if a service which legitimately has an API key for sending alerts in training mode is changed to live mode you now have an API key which people will think isn’t going to create a real alert but actually will. This feels really dangerous. Neither of these scenarios are things we should be doing, but having them possible still makes me feel uncomfortable. This commit revokes all API keys for a service when its broadcast settings change, same way we remove all permissions for its users.	2021-07-29 10:11:38 +01:00
Katie Smith	0c7982fd84	Always keep `view_activity` permissions for broadcast users We made a change to remove all permissions from users and invited users when the broadcast service settings form is submitted (https://github.com/alphagov/notifications-api/pull/3284). However, when the form is submitted, notifications-admin always adds the `view_activity` permission even if no permission boxes are ticked, so we don't want to remove that one permission (`256c840b46/app/main/forms.py (L1042)`)	2021-07-14 16:39:38 +01:00
Katie Smith	fc0b9736eb	Remove user permissions if service becomes a broadcast service The "normal" service permissions and broadcast service permissions are going to be different with no overlap. This means that if you were viewing the team members page, there might be permissions in the database that are not visible on the frontend if a service has changed type. For example, someone could have the 'manage_api_keys' permission, which would not show up on the team members page of a broadcast service. To avoid people having permissions which aren't visible in admin, we now remove all permissions from users when their service is converted to a broadcast service. Permisions for invited users are also removed. It's not possible to convert a broadcast service to a normal service, so we don't need to cover for this scenario.	2021-07-07 16:13:35 +01:00
Rebecca Law	35b20ba363	Correct the daily limits cache. Last year we had an issue with the daily limit cache and the query that was populating it. As a result we have not been checking the daily limit properly. This PR should correct all that. The daily limit cache is not being incremented in app.notifications.process_notifications.persist_notification, this method is and should always be the only method used to create a notification. We increment the daily limit cache is redis is enabled (and it is always enabled for production) and the key type for the notification is team or normal. We check if the daily limit is exceed in many places: - app.celery.tasks.process_job - app.v2.notifications.post_notifications.post_notification - app.v2.notifications.post_notifications.post_precompiled_letter_notification - app.service.send_notification.send_one_off_notification - app.service.send_notification.send_pdf_letter_notification If the daily limits cache is not found, set the cache to 0 with an expiry of 24 hours. The daily limit cache key is service_id-yyy-mm-dd-count, so each day a new cache is created. The best thing about this PR is that the app.service_dao.fetch_todays_total_message_count query has been removed. This query was not performant and had been wrong for ages.	2021-06-22 16:15:36 +01:00
Rebecca Law	d4a42471cb	Merge pull request #3267 from alphagov/fix-daily-totals-query Improve the query to get today's totals for a service.	2021-06-16 07:34:01 +01:00
Rebecca Law	08bb5c657f	Fix the query to get todays totals for a service. The query had a group by on notification_type and notification_status, this not only slows the query down but is wrong. The query only looked at the first result, but this query would return as many rows as different notification types and status, meaning the results do not include the correct number. Are we concerned that all status types are included. For example letters can be cancelled or have validation-failures which shouldn't be included in the daily limit check.	2021-06-14 15:29:21 +01:00
Katie Smith	0148b3dba6	Add new total_letters field to the billing report data This adds total_letters to the data that is returned by the `/platform-stats/data-for-billing-report` endpoint so that we can add total letters as a column in the CSV file that can be downloaded.	2021-06-11 11:31:22 +01:00
Rebecca Law	684a882cf3	Revert "Do not include today's totals"	2021-06-02 16:06:33 +01:00
Rebecca Law	c668bed9d3	Merge pull request #3256 from alphagov/no-totals-for-high-volume-services Do not include today's totals	2021-06-02 15:08:45 +01:00
Rebecca Law	a341536de0	- Add comment to test and new if statement - Update assert in test	2021-06-02 14:13:31 +01:00

1 2 3 4 5 ...

1356 Commits