notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-24 01:11:38 -05:00

Author	SHA1	Message	Date
Ben Thorner	2acc4ee67d	Repurpose command to replay notification callbacks This is so we can use it to address issues highlighted by the new alert, if it's not possible to actually send the notifications e.g. if they are somehow 'invalid'. Previously this was added for a one-off use case [1]. This rewrites the task to operate on arbitrary notification IDs instead of client refs, which aren't always present for notifications we may want to send / replay callbacks for. Since the task may now need to work on notifications more than one service, I had to restructure it to cope with multiple callback APIs. Note that, in the test, I've chosen to do a chain of invocations and assertions, rather than duplicate a load of boilerplate or introduce funky parametrize flags for a service with/out a callback API. We'll refactor this in a later commit. [1]: `e95740a6b5`	2021-12-06 14:00:35 +00:00
Ben Thorner	f96ba5361a	Add new task to alert about created email / sms This will log an error when email or SMS notifications have been stuck in 'created' for too long - normally they should be 'sending' in seconds, noting that we have a goal of < 10s wait time for most notifications being processed our platform. In the next commits we'll decouple similar functionality from the existing 'timeout-sending-notifications' task.	2021-12-06 14:00:31 +00:00
Ben Thorner	6435b57cd1	Merge pull request #3384 from alphagov/fix-cronitor-test-180344153 Fix flakey Cronitor test using caplog fixture	2021-12-01 12:44:27 +00:00
Ben Thorner	aea5d601f2	Fix flakey Cronitor test using caplog fixture This appears to not be thread safe: it started failing when run in parallel with other tests in this PR [1]. We don't get much out of using caplog over patching - it just proves our logging config isn't swallowing the error logs, which we shouldn't need to test here. [1]: https://github.com/alphagov/notifications-api/pull/3383	2021-11-26 17:17:45 +00:00
David McDonald	16ec8ccb8a	Merge pull request #3382 from alphagov/2dp Report processing stats to 2dp rather than 1dp	2021-11-26 09:48:05 +00:00
David McDonald	648490bf62	Report processing stats as floats rather than 1dp We are starting to see lots of 100.0%s in the current table and we think this looks suspiciously too good so think it is beneficial to change it to be 2dp such that we get a few more non 100.0% values. For the admin app to be able to show things to 2dp, we need to give at least 2dp of accuracy otherwise we are losing 1dp of granularity. The approach is to just give all the granularity available by returning the exact result from the DB and then the admin can choose how many dps to use.	2021-11-25 17:14:34 +00:00
Leo Hemsted	bab659c677	reduce number of services we try and delete notifications for TLDR: Don't return as many services, and only return their IDs and not the whole service objects. Context: the delete notifications nightly task has been taking longer and longer, and to delete all three notification types in sequence it now takes up to 8 hours. This is because we were retrieving all services, loading them into memory on the worker, and then trying to delete notifications for each service in turn. While it does use a fair chunk of IOPS/CPU on our postgres db, we're not anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1] The real issue appears to be that the task is CPU bound on the periodic worker - we see the worker spike up to 100% CPU regularly across the whole 3am-11am period. We also noticed that for each notification type the task first processes services with custom data retention (not many but some of the biggest users), then deals with all other services. We can see from looking at kibana that, for example, the task starts at 3am, and the custom data retention service email deletions are finished by 3:12am. The rest of the emails don't get deleted until 5am, so we knew that the problem is with how it handles the other services. There are currently 17000 services in the database. On a typical day, ~800 services will have notifications that are over 7 days old and need to be deleted. By only returning these services, we reduce the amount of data transfer and serialisation that needs to happen. It takes about two minutes to retrieve the distinct service ids from the notifications table for sms notifications, but that is only 5% the size of the full list so cuts down on a lot of processing Also, by only returning service_ids rather than the whole `Service` model we avoid sqlalchemy needing to do lots of data serialisation, when we were only using the `Service.id` field from that result anyway. [1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics [2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now [3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))	2021-11-24 16:18:40 +00:00
David McDonald	18776e4160	Merge pull request #3377 from alphagov/zero-case-performance-page Fix division by zero error on performance page	2021-11-22 13:44:32 +00:00
David McDonald	106187ba04	Fix division by zero error on performance page For preview and staging environments, we often send no messages in a single day. This is currently causing a `DivisionByZero` error that is rendering the page with no results. This makes it impossible to look at preview/staging and see if the performance page is working correctly or not. (psycopg2.errors.DivisionByZero) division by zero [SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage FROM ft_processing_time WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date] [parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}] (Background on this error at: http://sqlalche.me/e/14/9h9h) I've fixed this by falling back to 100.0% for days we send no messages. Maybe some argument that it should be N/A rather than 100% but I think it doesn't really matter as this is only going to affect preview and staging as we will never have a day sending no messages in production.	2021-11-22 11:11:52 +00:00
Chris Hill-Scott	c0742fe83d	Pass polygons through if they’re small already If a polygon is smaller than the largest polygon in our dataset of simplified polygons then we’re only throwing away useful detail by simplifying it. We should still simplify larger polygons as a fallback, to avoid sending anything to the CBC that we’re not sure it will like. The thresholds here are low: we can raise them as we test and experiment more. Here’s some data about the Flood Warning Service polygons Percentile \| 80% \| 90% \| 95% \| 98% \| 99% \| 99.9% -----------\|-----\|-------\|--------\|---------\|---------\|--------- Point count\| 226 \| 401.9 \| 640.45 \| 1015.38 \| 1389.07 \| 3008.609 Percentile \| 80% \| 90% \| 95% \| 98% \| 99% \| 99.9% --------------\|-----\|-------\|--------\|---------\|---------\|--------- Polygon count \|2----\|3------\|5-------\|8--------\|10-------\|40.469	2021-11-18 15:48:45 +00:00
Chris Hill-Scott	4feb3fdc10	Bump utils This new version of utils implements the transformation of our polygons to a Cartesian plane. In other words, it converts them from being defined in spherical degrees to metres. For the API this means our simplification will be slightly more accurate.	2021-11-18 15:43:39 +00:00
Ben Thorner	bffca39223	Merge pull request #3373 from alphagov/centralise-celery-180213914 Use central NotifyCelery base class in utils	2021-11-17 12:09:28 +00:00
Ben Thorner	666ac1ab4f	Log activity on all periodic Celery tasks As stated in the comment, this would have been helpful during an incident to give further reassurance that a task had at least started running - at the time the only evidence for this was the Cronitor dashboard itself, which we don't often look at. I've removed other, equivalent "starting" logs, but kept those that provide additional information in the log message.	2021-11-17 09:48:03 +00:00
Ben Thorner	528223ed61	Use central NotifyCelery base class in utils Note that the new base class doesn't include a bespoke feature we had here: 'log_on_worker_shutdown'. We've agreed it's reasonable to remove it for now as it was introduced many years ago and its use case is unclear - we can always add it back if needed.	2021-11-16 13:58:12 +00:00
Chris Hill-Scott	0236318189	Republish gov.uk/alerts every night to clear down planned tests We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner for the whole of the day when there has been an operator test on that day. We need to remove the banner when the day is over. The most straightforward way to do this is to republish the site at the start of every day. The gov.uk/alerts code[1] will work out if there are or aren’t any planned tests to show that day. 1. `5a274af6d0/app/models/alerts.py (L38-L44)`	2021-11-15 14:23:32 +00:00
Ben Thorner	d66c68d6d6	Merge pull request #3364 from alphagov/celery-headers-request-id-180213914 Move Celery task Request ID injection into headers	2021-11-12 11:10:29 +00:00
Ben Thorner	4a577eca62	Merge pull request #3359 from alphagov/improve-clarify-botocore-exception-180017131 Improve and clarify large task error handling	2021-11-12 11:10:17 +00:00
Ben Thorner	1872854a4e	Improve and clarify large task error handling Previously we were catching one type of exception if something went wrong adding a notification to the queue for high volume services. In reality there are two types of exception so this adds a second handler to cover both. For context, this is code we changed experimentally as part of the upgrade to Celery 5 [1]. At the time we didn't check how the new exception compared to the old one. It turns out they behaved the same and we were always vulnerable to the scenario now covered by the second exception, where the behaviour has changed in Celery 5 - testing with a large task invocation gives... Before (Celery 3, large-ish task): 'process_job.apply_async(["a" * 200000])'... boto.exception.SQSError: SQSError: 400 Bad Request <?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidParameterValue</Code><Message>One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.</Message><Detail/></Error><RequestId>96162552-cd96-5a14-b3a5-7f503300a662</RequestId></ErrorResponse> Before (Celery 3, very large task): <hangs forever> After (Celery 5, large-ish task): botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the SendMessage operation: One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes. After (Celery 5, very large task): botocore.parsers.ResponseParserError: Unable to parse response (syntax error: line 1, column 0), invalid XML received. Further retries may succeed: b'HTTP content length exceeded 1662976 bytes.' [1]: `29c92a9e54`	2021-11-11 17:37:50 +00:00
Ben Thorner	89a8dd1a03	Move Celery task Request ID injection into headers Previously we passed along this piece of state via the kwargs for a task, but this runs the risk of the task accidentally receiving the extra kwarg unless we've covered all the code paths that could invoke it directly e.g. retries don't invoke __call__. This switches to using Celery "headers" to pass the extra state. It turns out that a Celery has two "header" concepts, which leads to some confusion and even a bug with the framework [1]: - In older (pre v4.4) versions of Celery, the "headers" specified by apply_async() would become _the_ headers in the message that gets passed around workers, etc. These would be available later on via "self.request.headers". - Since Celery protocol v2, the meaning of "headers" in the message changed to become (basically) _all_ metadata about the task [2], with the "headers" option in apply_async() being merged [3] into the big dict of metadata. This makes using headers a bit confusing unfortunately, since the data structure we put in is subtly different to what comes out in the request context. Nonetheless, it still works. I've added some comments to try and clarify it. Note that one of the original tests is no longer necessary, since we don't need to worry about argument passing styles with headers. [1]: https://github.com/celery/celery/issues/4875 [2]: `663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)` [3]: `681a922220/celery/app/amqp.py (L495)`	2021-11-10 18:03:40 +00:00
Katie Smith	3d4796c924	Add task to resanitise and replace a PDF for precompiled letter This adds a task which is designed to be used if we want to recreate the PDF for a precompiled letter (either one that has been created using the API or one that has been uploaded through the website). The task takes the `notification_id` of the letter and passes template preview the details it needs in order to sanitise the original file and then replace the version in the letters-pdf bucket with the freshly sanitised version.	2021-11-10 09:51:31 +00:00
Ben Thorner	29c92a9e54	Try removing boto package again	2021-11-01 09:54:10 +00:00
Ben Thorner	d1586a8f81	CC DVLA in tickets about outstanding letters Previously we sent them emails about this manually. We also tried a Zendesk macro/trigger approach, but using a CC means: - We can control the behaviour ourselves (Zendesk triggers can only be edited by admins outside our team). - We keep the DVLA notification approach consistent and in one place, so notifications always go to the same people. - Any further (public) updates to the ticket will also trigger a notification to DVLA (previous trigger only notified on creation).	2021-10-29 11:46:29 +01:00
David McDonald	5a51ab6131	Bug fix: update normalised_to, not just `to` after letter sanitise When a precompiled letter is sent to us, we set the `to` field as 'Provided as PDF' in `1c1023a877/app/v2/notifications/post_notifications.py (L100-L104)` This then also sets `normalised_to` as `providedaspdf`. However, when template preview sanitises the letter, pulls out the address and gives it to the API, we were only setting `to` to be the new address and had forgotten to also amend `normalised_to` to be the normalised version. This meant that for all these letters we accidentally left `normalised_to` as `providedaspdf`. The impact of this was that we can not then search for these letters in the admin user interface as they rely on the `normalised_to` field containing the recipient address. This commit fixes that bug by also setting the `normalised_to` field	2021-10-27 11:56:25 +01:00
Chris Hill-Scott	54bcf618da	Store the `event` field from CAP XML broadcasts We don’t store everything that comes in the CAP XML when someone creates a broadcast via the API. One thing we do store is `<identifier>` (in a column called `reference`) which is a unique (to the external system) identifier for the broadcast. We show this in the front end instead of the template name, because broadcasts created from the API don’t use templates. However this ID isn’t very friendly – the Environment Agency just supply a UUID. The Environment Agency also populate the `<event>` field with some human readable text, for example: > 013 Issue Severe Flood Warning EA (013 is an area code which will be meaningful to the Flood Warning Service team) We should show this in the UI instead of the reference. The first step towards this is storing it in the database and returning it in the REST endpoints. Later we can have the admin app prefer `cap_event` over `reference`, where `cap_event` is present. We can’t backfill this data because we don’t keep a copy of the original XML. Seems like `<event>` is a mandatory property of `<info>`, so we don’t need to worry about the field being missing (`<info>` is optional in CAP but we require it because it contains stuff like the areas which we need in order to send out the broadcast`). *** https://www.pivotaltracker.com/story/show/176927060	2021-10-26 11:12:27 +01:00
Ben Thorner	d703251b13	Merge pull request #3348 from alphagov/better-callback-stats-180016688 Include status in stats about delivery times	2021-10-22 11:59:24 +01:00
Ben Thorner	f974108934	Include status in stats about delivery times Previously these metrics weren't very useful because they could be skewed by long timings for failed notifications, which can take up to 72 hours to deliver. I'm intentionally not trying to have a dual running period (with the old and new names) because: - We don't use the current stats for anything (checking Grafana). - The current stats get turned into a "bucket" metric in Prometheus [1][2], which isn't very useful because it can only tell us the mean time to deliver, but we're actually interested in percentiles. Switching to a new naming is an opportunity to fix the raw data and the way it's aggregated, using the same kind of "summary" metric that we now use for stats about our Celery tasks [3]. [1]: `c330a8ac8a/paas/statsd/statsd-mapping.yml (L82)` [2]: https://prometheus.io/docs/practices/histograms/#quantiles [3]: https://github.com/alphagov/notifications-aws/pull/890	2021-10-20 17:22:59 +01:00
Leo Hemsted	0b8c6ef263	Merge pull request #3339 from alphagov/letter-runbook-link tweak zendesk message for no ack files alert	2021-10-20 15:23:33 +01:00
Pea Tyczynska	1b6f9505da	Call `publish-govuk-alerts` task when alert expires The `auto-expire-broadcast-messages` task checks for expired broadcasts at five minute intervals. This change now calls the `publish-govuk-alerts` task in govuk-alerts if there are expired broadcasts so that the site is updated. Co-authored-by: Katie Smith <katie.smith@digital.cabinet-office.gov.uk>	2021-10-18 08:41:25 +01:00
Katie Smith	04bfd6bfdb	Trigger task to publish alerts when sending or cancelling alert When we send or cancel a broadcast message, we now trigger a task in govuk-alerts repo that polls our API for alerts and publishes a fresh list of alerts. Co-authored-by: Pea Tyczynska <pea.tyczynska@digital.cabinet-office.gov.uk>	2021-10-18 08:41:24 +01:00
Leo Hemsted	b8c4e19072	tweak zendesk message for no ack files alert include a link to a runbook entry. also the list of acknowledgement files can be very long, so make that the last thing, and use new lines to space out the message.	2021-10-08 13:45:02 +01:00
Chris Hill-Scott	544bfbf569	Add separate config item for failed login count It’s confusing that changing `MAX_VERIFY_CODE_COUNT` also limits the number of failed login attempts that a user of text messages 2FA can make. This makes the parameters independent, and adds a test to make sure any future changes which affect the limit of failed login attempts are covered.	2021-10-04 10:45:07 +01:00
Chris Hill-Scott	786893d920	Reduce max concurrent 2 factor codes I was doing some analysis and saw that in the last 24 hours the most codes that anyone had was in a 15 minute window was 3. So I think we can safely reduce this to 5 to get a bit more security with enough headroom to not have any negative impact to the user.	2021-10-04 10:45:06 +01:00
Chris Hill-Scott	19ad11e383	Don’t repeat digits in security codes People with dyslexia and dyscalculia find it difficult to transpose codes which have consecutive, repeated digits[1]. This commits enhances the algorithm for generating codes to not repeat the previous digit in a code. This reduces the key space for our codes from 100,000 possibilities to 65,610 possibilities. 1. https://twitter.com/annaecook/status/1442567679710150662	2021-09-30 10:24:17 +01:00
Katie Smith	58597653df	Update how "sending to TV numbers" Zendesk tickets are created	2021-09-29 11:26:20 +01:00
Katie Smith	0c0c7f4478	Update how "letters still created status" Zendesk tickets are created	2021-09-29 11:23:28 +01:00
Katie Smith	2f66e38fb9	Update how "missing ackfile for letters" Zendesk tickets are created	2021-09-29 11:10:50 +01:00
Katie Smith	64c0a3fb9d	Update how 'letters still sending' Zendesk tickets are created These now use the new Zendesk form.	2021-09-29 11:07:37 +01:00
Katie Smith	b114dadcae	Update how pending virus check Zendesk tickets are created This updates the tickets that are created when the `check_if_letters_still_pending_virus_check` scheduled task detects letters in the `pending-virus-check` state.	2021-09-29 11:03:48 +01:00
Katie Smith	9ff0ca0363	Update how live broadcast Zendesk tickets are created These now use the Notify Form in Zendesk	2021-09-29 10:59:07 +01:00
Ben Thorner	68eeb1defa	Merge pull request #3325 from alphagov/prevent-empty-areas-178986763 Add validation to prevent blank area names	2021-09-17 15:20:15 +01:00
Ben Thorner	d8a0967ec0	Add validation to prevent blank area names Now that these are used for display on gov.uk/alerts we need to make sure the data is being set properly. We've already found an example where it wasn't [1]. We validate external broadcasts in two stages: with the official CAP XML schema [2] and then again with our own, more specific schema for the converted JSON. Since this validation is a custom requirement I've made it part of the JSON schema. Note that jsonschema recommends avoiding metachars like "\w" since they're not supported by all implementations [3]. I've tested the new validation manually and it works as expected by disallowing e.g. " " but still alowing "foo" and "foo bar". [1]: https://www.notifications.service.gov.uk/services/120107d0-d99a-4c42-8b70-f37d2f28879b/rejected-alerts/d6e0c70e-60f6-4422-8589-2a2d159c63f2 [2]: `81a25ff1ef/app/xml_schemas/CAP-v1.2.xsd` [3]: http://json-schema.org/understanding-json-schema/reference/regular_expressions.html	2021-09-17 13:33:52 +01:00
Ben Thorner	6a53871455	Restructure govuk-alerts endpoint to be internal In response to: https://github.com/alphagov/notifications-api/pull/3305#pullrequestreview-726672421 Previously this was added among the public /v2 endpoints, but it's only meant for internal use. While only the govuk-alerts app would be able to access it, the location and /v2 URL suggested otherwise. This restructures the endpoint so it resembles other internal ones.	2021-09-15 15:36:17 +01:00
Ben Thorner	12640e5380	Fix misleading test name for error scenario	2021-09-15 11:02:51 +01:00
Leo Hemsted	33ca817e17	return contact list created_at in UTC, not BST make sure timestamps returned from the api are always consistent. The only place in models where we're serializing a BST timestamp is on the Notification.serialize_for_csv method now, which at least is a bit different as this is user-facing (it also returns a formatted human-readable notification_status for example).	2021-09-14 12:41:52 +01:00
Ben Thorner	81a25ff1ef	Merge pull request #3327 from alphagov/bump-utils-46-0-0 Bump utils to 46.0.0	2021-09-14 11:22:10 +01:00
Ben Thorner	c53ee6de94	Merge pull request #3323 from alphagov/add-permission-local-dev Make it easy to develop with broadcast services	2021-09-14 11:22:00 +01:00
Ben Thorner	4d05b064fd	Remove redundant DB fixtures in conftest.py These were only there to ensure a DB session existed for the test and are now included implicitly as one of the dependencies of the "sample_user" fixture.	2021-09-14 09:37:32 +01:00
Ben Thorner	c9ea85938a	Refactor "create_user" to actually create a user This switches a number of fixtures to use "sample_user", which is equivalent to calling the previous "create_user" function when it used to default the email to "notify@...".	2021-09-14 09:29:28 +01:00
Ben Thorner	922fd2f333	Support testing commands and add first test We have a lot of commands and it's important we test the ones that are meant to be used in the future to ensure they work when they're needed. Testing Flask commands is usually easy as written in their docs [1], but I had to make some changes to the way we decorate the command functions so they can work with test DB objects - I couldn't find any example of someone else encountering the same problem. [1]: https://flask.palletsprojects.com/en/2.0.x/testing/#testing-cli-commands	2021-09-14 09:29:23 +01:00
sakisv	9faa3d34e1	Fix tests Specifically, no longer test for a p1 zendesk when sending an alert and drop misleading "p1" from test name when cancelling an alert. We're no longer creating a P1 from the code, but we _do_ create a zendesk ticket when sending out an alert. When cancelling, what we want to test is that we don't create a second ticket when the alert is cancelled.	2021-09-10 10:14:28 +03:00

1 2 3 4 5 ...

4085 Commits