notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-24 01:11:38 -05:00

Author	SHA1	Message	Date
David McDonald	1973994516	Merge pull request #3391 from alphagov/pagination-approach-change Pagination approach change for `get_notifications_for_service`	2021-12-09 10:43:14 +00:00
Chris Hill-Scott	0481b14803	Merge pull request #3392 from alphagov/bump-util-51 Bump notifications-utils to 51.0.0	2021-12-06 16:51:38 +00:00
Chris Hill-Scott	f011254667	Bump notifications-utils to 51.0.0 Just so other people don’t have to merge these changes. The breaking changes don’t affect this repo because the API doesn’t: - check the service guestlist before sending a message - do any visual preview of emergency alert messages > 51.0.0 > - Initial argument to RecipientCSV renamed from whitelist to guestlist, in other words consuming code should call RecipientCSV(guestlist=['test@example.com']) > - RecipientCSV.whitelist property renamed to RecipientCSV.guestlist > > 50.0.0 > - Make icon in broadcast_preview_template.jinja2 an inline SVG (requires changes to the CSS of consumer code) > > 49.1.0 > Add ttl_in_seconds argument to RequestCache.set to let users specify a custom TTL This commit also changes the format of the line in the requirements file, copying https://github.com/alphagov/notifications-admin/pull/4074/files	2021-12-06 09:34:15 +00:00
David McDonald	e8dd136678	Document area that may be doing pagination links when not needed	2021-12-03 17:32:40 +00:00
David McDonald	c68d1a2f23	Optimise queries run for creating pagination links We have been running in to the problem in https://github.com/pallets/flask-sqlalchemy/issues/518 where our page loads very slow when viewing a single page of notifications for a service in the admin app. Tracing this back and using SQL explain analyze I can see that getting the notifications takes about a second but the second query to count how many notifications there are (to work out if there is a next page of pagination) can take up to 100 seconds. As suggested in that issue, we do the pagination ourselves. Our pagination doesn't need us to know exactly how many notifications there are, just whether there are any on the next page and that can be done without running the slow query to count how many notifications in total by using `count_pages=False`.	2021-12-03 17:32:39 +00:00
David McDonald	989ef9c21a	Remove `last` and `total` keys from pagination links These don't appear to be used anywhere in the admin app and this route is only used by the admin app. Therefore it is safe to remove them. We remove them because the calculate the total number of notifications or the final page number of results can be particularly slow for services with many many notifications, for example 100 seconds for a service with 500k notifications sent in the past 7 days. Given neither are being used, this will give us the potential in the next commit to reduce the number of slow queries and improve page load times. Note, I've kept the scope small by only introducing the new pagination function for this one endpoint but there could be scope in future to get all pagination using the next function if appropriate.	2021-12-03 17:26:49 +00:00
David McDonald	a62e63fcef	Add tests for existing pagination behaviour No functionality change, just documenting what already exists	2021-12-03 17:21:14 +00:00
Leo Hemsted	a8cad79def	Merge pull request #3390 from alphagov/reporting-q put delete tasks on the reporting worker	2021-12-03 16:08:42 +00:00
Leo Hemsted	f6d210f1e6	put delete tasks on the reporting worker they share a lot with the reporting tasks (creating ft_billing and ft_notification_status), in that they're run nightly, take a long time, and we see error messages if they get run multiple times (due to visibility timeout). The periodic app has two concurrent processes - previously there was just one delete task, which would use one of those processes, while the other process would pick up anything else on the queue (at that time of night, the regular provider switch checks and scheduled job checks). However, when we switched to running the three delete notification types separately, we saw visibility timeout issues - three tasks would be created, all three would be picked up by one celery instance, the two worker processes would start on two of them, and the third would sit on the box, wait longer than the visibility timeout to be picked up (and acknowledged), and so SQS would assume the task was lost and replay it. it's queues all the way down! By putting them on the reporting worker we can take advantage of tuning that app (for example setting the prefetch multiplier to one) which is designed to run large tasks. We've also got more concurrent workers on this box, so we can run all three tasks at once.	2021-12-03 13:28:16 +00:00
Leo Hemsted	595ff134d7	Merge pull request #3388 from alphagov/parallel-deletes make delete notification tasks parallel by notification type	2021-12-02 10:33:53 +00:00
Leo Hemsted	6bbec9f103	make delete notification tasks parallel by notification type we used to do this until apr 2020. Let's try doing it again. Back then, we had problems with timing. We did two things in spring 2020: We moved to using an intermediary temp table [1] We stopped the tasks being parallelised [2] However, it turned out the real time saving was from changing what services we delete for [3]. The task was actually CPU-bound rather than DB-bound, so that's probably why having the tasks in parallel wasn't helping, since they were all competing for the same CPU. It's worth trying the parallel steps again now that we're no longer CPU bound. Note: Temporary tables are in their own postgres schema, and are only viewable by the current session (session == connection. Each celery worker process has its own db connection). We don't need to worry about separate workers both trying to use the same table at once. I've also added a "DROP ON COMMIT" directive to the table definition just to ensure it doesn't persist past the task even if there's an exception. (This also drops on rollback). Cronitor looks at the three functions separately so we don't need to worry about the main task taking milliseconds where it used to take hours as it isn't monitored itself. I've also removed some unnecessary redundant exception logs. [1] https://github.com/alphagov/notifications-api/pull/2767 [2] https://github.com/alphagov/notifications-api/pull/2798 [3] https://github.com/alphagov/notifications-api/pull/3381	2021-12-01 14:28:08 +00:00
Ben Thorner	6435b57cd1	Merge pull request #3384 from alphagov/fix-cronitor-test-180344153 Fix flakey Cronitor test using caplog fixture	2021-12-01 12:44:27 +00:00
Rebecca Law	c78e6c8571	Merge pull request #3386 from alphagov/increase-concurrency-for-reporting-app Increase the concurrency for the delivery-worker-reporting	2021-12-01 11:51:50 +00:00
Rebecca Law	ddc03a9f5c	Merge pull request #3385 from alphagov/improve-is_provider_slow-query Improve query performance	2021-12-01 11:41:51 +00:00
Rebecca Law	e7efeec309	Increase the concurrency for the delivery-worker-reporting TL;DR After a chat with some team members we've decided to double the concurrency of the delivery-worker-reporting app to 4 from 2. Looking at the memory usage during the reporting task runs we don't believe this to be a risk. There are some other things to look at, but this could be a quick win in the short term. Longer read: Every night we have 2 "reporting" tasks that run. - create-nightly-billing starts at 00:15 - populates data for ft_billing for the previous days. - 4 days for email - 4 days for sms - 10 days for letters - create-nightly-notification-status starts at 00:30 - populates data for ft_notification - 4 days for email - 4 days for sms - 10 days for letters These tasks are picked up by the `notify-delivery-worker-reporting` app, we run 3 instances with a concurrency = 2. This means that we have 6 worker threads that pick up the 18 tasks created at 00:15 and 00:30. Each celery main thread picks up 10 tasks of the queue, the 2 worker threads start working on a task and acknowledge the task to SQS. Meanwhile the other 8 tasks wait in the internal celery queue and are no acknowledgement is sent to SQS. As each task is complete a worker picks up a new thread, acknowledges the task. If a task is kept in the Celery internal queue for longer than 5 minutes the visibility timeout in SQS will assume the task has not completed and put the task back on the availability queue, therefore creating a duplicate task. At some point all the tasks are completed, some are completed twice.	2021-12-01 11:40:18 +00:00
Rebecca Law	101498ec84	Improve query performance Adding a filter to `app.dao.notifications_dao.is_delivery_slow_for_providers` query to improve the performance. By added Notifications.notification_type = 'sms' to the query it will improve the performance some analyse shows 500ms improvement, which is a good thing especially when the query is run once a minute.	2021-11-30 16:42:32 +00:00
Katie Smith	ad313065bf	Merge pull request #3368 from alphagov/org-agreement-details Add command to populate organisation table with agreement details	2021-11-30 14:30:56 +00:00
Katie Smith	250ce38cf2	Remove unecessary list-routes command Since this was added, Flask now comes with a build in command to list the routes, `flask routes`, so this is not needed.	2021-11-30 11:11:49 +00:00
Katie Smith	6d9f2c27d9	Add command to populate organisation table with agreement details When we first started recording the details of the agreements that were signed by organisations, we stored a copy of the signed agreement in Google drive. Later, we switched to storing the details in the database instead. This adds a command which is designed to be run once and which updates the database for the organisations which had the details of who accepted the agreement and when stored in Google drive.	2021-11-30 11:11:49 +00:00
Ben Thorner	aea5d601f2	Fix flakey Cronitor test using caplog fixture This appears to not be thread safe: it started failing when run in parallel with other tests in this PR [1]. We don't get much out of using caplog over patching - it just proves our logging config isn't swallowing the error logs, which we shouldn't need to test here. [1]: https://github.com/alphagov/notifications-api/pull/3383	2021-11-26 17:17:45 +00:00
David McDonald	16ec8ccb8a	Merge pull request #3382 from alphagov/2dp Report processing stats to 2dp rather than 1dp	2021-11-26 09:48:05 +00:00
David McDonald	648490bf62	Report processing stats as floats rather than 1dp We are starting to see lots of 100.0%s in the current table and we think this looks suspiciously too good so think it is beneficial to change it to be 2dp such that we get a few more non 100.0% values. For the admin app to be able to show things to 2dp, we need to give at least 2dp of accuracy otherwise we are losing 1dp of granularity. The approach is to just give all the granularity available by returning the exact result from the DB and then the admin can choose how many dps to use.	2021-11-25 17:14:34 +00:00
Leo Hemsted	ad263f6172	Merge pull request #3381 from alphagov/delete-notification-optimisation reduce number of services we try and delete notifications for	2021-11-24 16:34:20 +00:00
Leo Hemsted	bab659c677	reduce number of services we try and delete notifications for TLDR: Don't return as many services, and only return their IDs and not the whole service objects. Context: the delete notifications nightly task has been taking longer and longer, and to delete all three notification types in sequence it now takes up to 8 hours. This is because we were retrieving all services, loading them into memory on the worker, and then trying to delete notifications for each service in turn. While it does use a fair chunk of IOPS/CPU on our postgres db, we're not anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1] The real issue appears to be that the task is CPU bound on the periodic worker - we see the worker spike up to 100% CPU regularly across the whole 3am-11am period. We also noticed that for each notification type the task first processes services with custom data retention (not many but some of the biggest users), then deals with all other services. We can see from looking at kibana that, for example, the task starts at 3am, and the custom data retention service email deletions are finished by 3:12am. The rest of the emails don't get deleted until 5am, so we knew that the problem is with how it handles the other services. There are currently 17000 services in the database. On a typical day, ~800 services will have notifications that are over 7 days old and need to be deleted. By only returning these services, we reduce the amount of data transfer and serialisation that needs to happen. It takes about two minutes to retrieve the distinct service ids from the notifications table for sms notifications, but that is only 5% the size of the full list so cuts down on a lot of processing Also, by only returning service_ids rather than the whole `Service` model we avoid sqlalchemy needing to do lots of data serialisation, when we were only using the `Service.id` field from that result anyway. [1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics [2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now [3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))	2021-11-24 16:18:40 +00:00
David McDonald	18776e4160	Merge pull request #3377 from alphagov/zero-case-performance-page Fix division by zero error on performance page	2021-11-22 13:44:32 +00:00
David McDonald	106187ba04	Fix division by zero error on performance page For preview and staging environments, we often send no messages in a single day. This is currently causing a `DivisionByZero` error that is rendering the page with no results. This makes it impossible to look at preview/staging and see if the performance page is working correctly or not. (psycopg2.errors.DivisionByZero) division by zero [SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage FROM ft_processing_time WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date] [parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}] (Background on this error at: http://sqlalche.me/e/14/9h9h) I've fixed this by falling back to 100.0% for days we send no messages. Maybe some argument that it should be N/A rather than 100% but I think it doesn't really matter as this is only going to affect preview and staging as we will never have a day sending no messages in production.	2021-11-22 11:11:52 +00:00
Chris Hill-Scott	2b6a550cdc	Merge pull request #3372 from alphagov/update-utils-coordinate-transformation-2 Update utils to bring in coordinate transformation (attempt 2)	2021-11-18 16:10:45 +00:00
Chris Hill-Scott	c0742fe83d	Pass polygons through if they’re small already If a polygon is smaller than the largest polygon in our dataset of simplified polygons then we’re only throwing away useful detail by simplifying it. We should still simplify larger polygons as a fallback, to avoid sending anything to the CBC that we’re not sure it will like. The thresholds here are low: we can raise them as we test and experiment more. Here’s some data about the Flood Warning Service polygons Percentile \| 80% \| 90% \| 95% \| 98% \| 99% \| 99.9% -----------\|-----\|-------\|--------\|---------\|---------\|--------- Point count\| 226 \| 401.9 \| 640.45 \| 1015.38 \| 1389.07 \| 3008.609 Percentile \| 80% \| 90% \| 95% \| 98% \| 99% \| 99.9% --------------\|-----\|-------\|--------\|---------\|---------\|--------- Polygon count \|2----\|3------\|5-------\|8--------\|10-------\|40.469	2021-11-18 15:48:45 +00:00
Chris Hill-Scott	4feb3fdc10	Bump utils This new version of utils implements the transformation of our polygons to a Cartesian plane. In other words, it converts them from being defined in spherical degrees to metres. For the API this means our simplification will be slightly more accurate.	2021-11-18 15:43:39 +00:00
Rebecca Law	443f197fee	Merge pull request #3376 from alphagov/update-query-for_insert_notification_history_delete_notifications Small update to query to reduce load on the task.	2021-11-18 11:28:44 +00:00
Rebecca Law	30a5852685	Update the query to only return the count from the table since that is all we care about. https://www.pivotaltracker.com/story/show/180262357	2021-11-17 14:46:52 +00:00
Ben Thorner	bffca39223	Merge pull request #3373 from alphagov/centralise-celery-180213914 Use central NotifyCelery base class in utils	2021-11-17 12:09:28 +00:00
Ben Thorner	e6b91f67d6	Merge pull request #3374 from alphagov/log-periodic-180330449 Log activity on all periodic Celery tasks	2021-11-17 12:09:19 +00:00
Ben Thorner	666ac1ab4f	Log activity on all periodic Celery tasks As stated in the comment, this would have been helpful during an incident to give further reassurance that a task had at least started running - at the time the only evidence for this was the Cronitor dashboard itself, which we don't often look at. I've removed other, equivalent "starting" logs, but kept those that provide additional information in the log message.	2021-11-17 09:48:03 +00:00
Ben Thorner	528223ed61	Use central NotifyCelery base class in utils Note that the new base class doesn't include a bespoke feature we had here: 'log_on_worker_shutdown'. We've agreed it's reasonable to remove it for now as it was introduced many years ago and its use case is unclear - we can always add it back if needed.	2021-11-16 13:58:12 +00:00
David McDonald	782aef351c	Merge pull request #3369 from alphagov/remove-o-fair Remove -Ofair option from celery worker	2021-11-16 11:49:52 +00:00
Ben Thorner	4e7b5e0104	Merge pull request #3371 from alphagov/reduce-concurrency-180116935 Reduce concurrency to match number of CPUs	2021-11-16 10:22:21 +00:00
Ben Thorner	fd2d411085	Merge pull request #3370 from alphagov/mention-pycurl Link to guidance about installing pycurl	2021-11-16 10:18:33 +00:00
Ben Thorner	82e4c3dad2	Reduce concurrency to match number of CPUs This got missed in [1]. [1]: `9e9091e980`	2021-11-15 16:45:05 +00:00
Chris Hill-Scott	4b44e3e223	Merge pull request #3358 from alphagov/remove-yesterdays-planned-tests-on-govuk-alerts Republish gov.uk/alerts every night to clear down planned tests	2021-11-15 15:34:21 +00:00
Ben Thorner	0fbca71545	Link to guidance about installing pycurl This seems to be an issue for several people when we install new versions of the package. Older versions of the package seem to be equally affected, so the new need for this is likely related to us using a newer OS / XCode version.	2021-11-15 15:21:09 +00:00
Chris Hill-Scott	0236318189	Republish gov.uk/alerts every night to clear down planned tests We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner for the whole of the day when there has been an operator test on that day. We need to remove the banner when the day is over. The most straightforward way to do this is to republish the site at the start of every day. The gov.uk/alerts code[1] will work out if there are or aren’t any planned tests to show that day. 1. `5a274af6d0/app/models/alerts.py (L38-L44)`	2021-11-15 14:23:32 +00:00
Chris Hill-Scott	2f3c6112ba	Merge pull request #3361 from alphagov/celery-5.2.0 Bump Celery to latest version	2021-11-15 14:23:05 +00:00
David McDonald	c646176594	Remove -Ofair option from celery worker In version 4.0 of celery, -Ofair became the default scheduling strategy: https://docs.celeryproject.org/en/latest/history/whatsnew-4.0.html?highlight=fair#ofair-is-now-the-default-scheduling-strategy This appears to still be the case: `5d68d781de/celery/concurrency/asynpool.py (L80)` Note, it took me a while to be certain of this as the documentation for the celery CLI suggests a choice of `default` or `fair` which isn't so useful as both of these are `fair`: https://docs.celeryproject.org/en/latest/reference/cli.html#cmdoption-celery-worker-O	2021-11-15 11:52:57 +00:00
Chris Hill-Scott	0aa7cf1aaf	Tell Pyup to ignore outdated Eventlet version We already do this in the admin app: https://github.com/alphagov/notifications-admin/pull/3876/files Upgrading Eventlet is blocked until this change in Gunicorn is released: https://github.com/benoitc/gunicorn/pull/2581/files	2021-11-15 11:14:34 +00:00
Chris Hill-Scott	6c0bda0388	Bump Celery to latest version This brings in the version 5.2.1 of Kombu, which fixes a security vulnerability: > Celery 5.2.0 includes 'kombu' v5.2.1, which includes dependencies > updates that resolve security issues. — https://pyup.io/repos/github/alphagov/notifications-api/commits/?page=1#b654c27699a5164cbbe50e042d5d34141f560255 This is the commit from Kombu: `f3b04558fa` I believe the dependency of Kombu which has issues is urllib3, which has two open advisories for versions less than 1.26.5: - https://github.com/urllib3/urllib3/security/advisories/GHSA-q2q7-5pp4-w6pg - https://github.com/urllib3/urllib3/security/advisories/GHSA-5phf-pp7p-vc2r	2021-11-15 11:12:33 +00:00
David McDonald	608ef12573	Merge pull request #3367 from alphagov/better-log-message Improve log message searchability for duplicate receipts	2021-11-15 09:38:30 +00:00
David McDonald	c98996a461	Improve log message searchability for duplicate receipts There were two problems with the existing message. 1. There was no space between the new status and the time taken which made reading and searching harder 2. They key bits of information (before and after status) were separated by the time taken (which will always be unique) meaning you couldn't do an easy search for a message that is say in delivered being attempted to be set to temporary-failure.	2021-11-12 14:06:38 +00:00
Ben Thorner	48e1482d90	Merge pull request #3366 from alphagov/celery-extend-request-id-180213914 Extend request tracing to cover Celery logs	2021-11-12 11:10:38 +00:00
Ben Thorner	d66c68d6d6	Merge pull request #3364 from alphagov/celery-headers-request-id-180213914 Move Celery task Request ID injection into headers	2021-11-12 11:10:29 +00:00

1 2 3 4 5 ...

8533 Commits