notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-22 16:31:15 -05:00

Author	SHA1	Message	Date
Leo Hemsted	d582e35471	dont try and send broadccast event if it's already in technical-failure this gives us an option to manually set a status in the database and avoid things being stuck in a retry loop forever	2021-02-05 12:52:37 +00:00
Leo Hemsted	0ddebc63a8	reduce broadcast retry delay to 4 mins and drop prefetch. ### The facts * Celery grabs up to 10 tasks from an SQS queue by default * Each broadcast task takes a couple of seconds to execute, or double that if it has to go to the failover proxy * Broadcast tasks delay retry exponentially, up to 300 seconds. * Tasks are acknowledged when celery starts executing them. * If a task is not acknowledged before its visibility timeout of 310 seconds, sqs assumes the celery app has died, and puts it back on the queue. ### The situation A task stuck in a retry loop was reaching its visbility timeout, and as such SQS was duplicating it. We're unsure of the exact cause of reaching its visibility timeout, but there were two contributing factors: The celery prefetch and the delay of 300 seconds. Essentially, celery grabs the task, keeps an eye on it locally while waiting for the delay ETA to come round, then gives the task to a worker to do. However, that worker might already have up to ten tasks that it's grabbed from SQS. This means the worker only has 10 seconds to get through all those tasks and start working on the delayed task, before SQS moves the task back into available. (Note that the delay of 300 seconds is translated into a timestamp based on the time you called self.retry and put the task back on the queue. Whereas the visibility timeout starts ticking from the time that a celery worker picked up the task.) ### The fix #### Set the max retry delay for broadcast tasks to 240 seconds Setting the max delay to 240 seconds means that instead of a 10 second buffer before the visibility timeout is tripped, we've got a 70 second buffer. #### Set the prefetch limit to 1 for broadcast workers This means that each worker will have up to 1 currently executing task, and 1 task pending execution. If it has these, it won't grab any more off the queue, so they can sit there without their visibility timeout ticking up. Setting a prefetch limit to 1 will result in more queries to SQS and a lower throughput. This might be relevant in, eg, sending emails. But the broadcast worker is not hyper-time critical. https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time	2021-02-05 12:49:51 +00:00
Leo Hemsted	eff0119f5c	dont update finishes at when cancelling broadcast otherwise we run into issues where we dont issue the cancel as we say "oh look the expiry time just passed, so we shouldnt send this message as it's already been removed from the cbc".	2021-02-04 14:25:38 +00:00
Leo Hemsted	1ef3f96bd7	test sending broadcast message for all statuses of existing provider_msg also clean up some comments	2021-02-04 11:50:06 +00:00
Leo Hemsted	bbae209200	check provider message status etc when sending rather than when retrying previously if we were deciding whether to retry or not, it meant that future events wouldn't have context of what the task is doing. We'd run into issues with not knowing what references to include when updating/cancelling in future events. Instead of deciding whether to retry or not, always retry. Instead, when any event sends, regardless of whether it is a first time or a retry, check the status of previous events for that broadcast message. There are a few things that will mean we don't send. * If the finishes_at time has already elapsed (ie: we have been trying to resend this message and haven't had any luck and now the data is obselete) * A previous event has no provider message (this means that we never picked the previous event off the queue for some reason) * A previous event has a provider message that has anything other than an ack response. This includes sending (the old message is currently being sent), and technical-failure/returned-error (the old message is currently in the retry loop, having experienced issues).	2021-02-03 18:11:52 +00:00
Leo Hemsted	96a0935d1c	update broadcast provider message status on success/error so we can distinguish errorring messages that are currently retrying from those that sent succesfully.	2021-02-03 18:03:16 +00:00
Leo Hemsted	3dcbfc3612	re-use existing provider message if task retries previously it would crash with a unique constraint error. now, grab the previous message.	2021-02-03 18:01:54 +00:00
Leo Hemsted	ac34fb9c05	retry sending broadcasts Retry tasks if they fail to send a broadcast event. Note that each task tries the regular proxy and the failover proxy for that provider. This runs a bit differently than our other retries: Retry with exponential backoff. Our other tasks retry with a fixed delay of 5 minutes between tries. If we can't send a broadcast, we want to try immediately. So instead, implement an exponential backoff (1, 2, 4, 8, ... seconds delay). We can't delay for longer than 310 seconds due to visibility timeout settings in SQS, so cap the delay at that amount. Normally we give up retrying after a set amount of retries (often 4 hours). As broadcast content is much more important than normal notifications, we don't ever want to give up on sending them to phones... ...UNLESS WE DO! Sometimes we do want to give up sending a broadcast though! Broadcasts have an expiry time, when they stop showing up on peoples devices, so if that has passed then we don't need to send the broadcast out. Broadcast events can also be superceded by updates or cancels. Check that the event is the most recent event for that broadcast message, if not, give up, as we don't want to accidentally send out two conflicting events for the same message.	2021-02-03 16:43:01 +00:00
David McDonald	d390eb2cac	Merge pull request #3112 from alphagov/channel-restriction Set broadcast channel as a service setting	2021-02-03 11:46:04 +00:00
David McDonald	070b79c27e	Downgrade exceptions to warnings to reduce emails We already trigger a zendesk ticket for these two cases, meaning that whenever we get this situation, we get 3 emails. One for the zendesk ticket, one from logit raising the fact an exception was raised and one from cloudwatch raising the fact an exception was raised. We don't need all these emails, a zendesk ticket is sufficient. Downgrading to a warning means this event will still be findable in our logs however.	2021-02-02 15:10:26 +00:00
David McDonald	f90b479c8d	Use service setting to pick broadcast channel This falls back to the "test" channel if they do not have a ServiceBroadcastSetting for the moment, but we intend in future PRs to enforce that all broadcast services will have this property.	2021-02-01 14:10:41 +00:00
David McDonald	b2ed9efe85	Extend test cases for every MNO Seems to be no harm in extending these for every mobile network just to give us slightly better coverage	2021-02-01 14:10:40 +00:00
David McDonald	2aad3163e6	Allow CBC proxy client to take channel This moves the hardcoding to test channels one step up to where we call `create_and_send_broadcast` We can then after this, start to differ whether we give it the 'test' or 'severe' channel based on the services channel setting.	2021-02-01 14:10:38 +00:00
David McDonald	91f5be835a	Add DB table for service broadcast settings This will allow us to store details of which channel a service should be sending to. See the comment about how all broadcast services can have a row in the table but may not at the moment. This has been done for speed as it's the quickest way to let us set up different services to send to different channels for some needed testing with the mobile handset providers in the coming week.	2021-02-01 14:10:37 +00:00
David McDonald	a3d966056a	Merge pull request #3110 from alphagov/test-channel Set the default broadcast channel to test	2021-02-01 10:18:35 +00:00
Katie Smith	fc9ecaba1d	Bump utils to 43.8.1 This brings in the change to stop TV numbers from being treated as international numbers.	2021-01-29 15:53:35 +00:00
David McDonald	86ea89cf76	Merge pull request #3098 from alphagov/downgrade-to-warning Downgrade SMS provider request exceptions to warnings	2021-01-29 11:52:10 +00:00
Chris Hill-Scott	837e464081	Merge pull request #3060 from alphagov/add-broadcast-api-endpoint Add public API endpoint to create broadcast messages	2021-01-28 12:59:41 +00:00
Pea Tyczynska	51c0ece130	Merge pull request #3108 from alphagov/stub-training-broadcasts Stub training broadcasts	2021-01-28 11:58:47 +00:00
Katie Smith	4ed79dae48	Merge pull request #3105 from alphagov/rename-bt-ee Rename bt-ee-proxy to ee-proxy	2021-01-27 17:05:26 +00:00
David McDonald	f9b1d3d573	Set the default broadcast channel to test This used to be hardcoded in the CBC proxy but now we will hardcode it in the cbc_proxy_client. In a future PR we can start choosing which channel a broadcast will go to based on the channel configured for that broadcast service.	2021-01-27 15:27:11 +00:00
Pea Tyczynska	d4cc250510	Don't create broadcast provider messages for stubbed broadcasts	2021-01-27 10:20:44 +00:00
Pea Tyczynska	26d6b4a958	Mark broadcast message as stubbed when sent from training account	2021-01-27 10:20:43 +00:00
Chris Hill-Scott	0398ac57f1	Use correct HTTP status code for bad content type 415 is the status code for ‘Unsupported media type’ https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/415	2021-01-26 16:24:45 +00:00
Chris Hill-Scott	b85fcafd46	Don’t allow broadcasts to be created from JSON Until we know we’re going to have real users for this, let’s not expose it.	2021-01-26 16:24:45 +00:00
Chris Hill-Scott	c9d55039eb	Simplify polygons before storing them We’re going to let people pass in fairly complex polygons, but: - we don’t want to store massive polygons - we don’t want to pass the CBCs massive polygons So this commit adds a step to simplify the polygons before storing them. We think it’s best for us to do this because: - writing code to do polygon simplification is non-trivial, and we don’t want to make all potential integrators do it - the simplification we’ve developed is domain-specific to emergency alerting, so should throw away less information than There’s a bit more detail about how we simplify polygons in https://github.com/alphagov/notifications-admin/pull/3590/files	2021-01-26 16:24:45 +00:00
Chris Hill-Scott	26871eeacc	Validate CAP against the spec This gives us some extra confidence that there aren’t any problems with the data we’re getting from the other service. It doesn’t address any specific problems we’ve seen, rather it seems like a sensible precaution to take.	2021-01-26 16:24:45 +00:00
Chris Hill-Scott	38f07db23e	Accept CAP XML This commit makes the existing endpoint also accept CAP XML, should the appropriate `Content-Type` header be set. It uses the translation code we added in a previous commit to convert the CAP to a dict. We can then validate that dict against with the JSON schema to ensure it’s something we can work with.	2021-01-26 16:24:44 +00:00
Chris Hill-Scott	7530408a21	Validate broadcast against schema This commit adds a JSONSchema which can validate the fields in an API call to create a broadcast. It takes the CAP XML schema as a starting point.	2021-01-26 16:24:44 +00:00
Chris Hill-Scott	61c9e50ed9	Add public API endpoint to create emergency alerts We know there is at least one system which wants to integrate with Notify to send out emergency alerts, rather than creating them manually. This commit adds an endpoint to the public API to let them do that. To start with we’ll just let the system create them in a single call, meaning they still have to be approved manually. This reduces the risk of an attacker being able to broadcast an alert via the API, should the other system be compromised. We’ve worked with the owners of the other system to define which fields we should care about initially.	2021-01-26 16:24:44 +00:00
Pea Tyczynska	dfbd31cef8	Merge pull request #3106 from alphagov/billing-fields-for-service Add billing details fields to Service model and db table	2021-01-26 15:14:05 +00:00
Katie Smith	2681752f15	Rename bt-ee-proxy to ee-proxy We want to rename the `bt-ee-1-proxy` lambda function to `ee-1-proxy`. This change will need to be deployed at the same time that we change the name of the lambda function in the Terraform.	2021-01-26 14:36:20 +00:00
Richard Baker	6256cdf792	Add proxy client for o2 cell croadcasting o2 use One-2-many CBC so we can use the O2M/CAP client. Once differences between CBCs have been worked out we can consolidate O2M clients to reduce duplication. Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>	2021-01-26 11:11:44 +00:00
Pea Tyczynska	b3abdfb401	Rename billing contact email and name fields to plural So: 'billing_contact_email_address' becomes 'billing_contact_email_addresses' AND 'billing_contact_name' becomes 'billing_contact_names' This is to signify that each of those fields can contain numerous items	2021-01-25 17:53:27 +00:00
Pea Tyczynska	ffac16a2a0	Add new billing details to test_get_service_by_id	2021-01-25 17:42:18 +00:00
David McDonald	c61bd9976f	Remove ability for platform admins to approve own broadcast This has been added in for speed of development but now we are getting close to integrating with production systems, we will be turning off these helpful hacks to reduce the risk of someone sending a real broadcast to citizens. Note, platforms are still able to approve broadcasts when their service is in training mode.	2021-01-22 16:56:05 +00:00
Chris Hill-Scott	5c8b5e0488	Merge pull request #3095 from alphagov/allow-admin-to-create-broadcasts-without-templates Let the admin app create broadcasts without templates	2021-01-19 16:57:33 +00:00
Chris Hill-Scott	94aea8a820	Add test for when no content or template provided This is the missing invalid permutation of fields for creating a broadcast.	2021-01-19 15:28:08 +00:00
Pea Tyczynska	882da84182	Merge pull request #3096 from alphagov/add-notes-to-service Add notes column to services table	2021-01-19 14:45:42 +00:00
David McDonald	ac6837cde5	Downgrade exception to warning for provider API call When we send an HTTP request to our SMS providers, there is a chance we get a 5xx status code back from them. Currently we log this as two different exception level logs. If a provider has a funny few minutes, we could end up with hundreds of exceptions thrown and pagerduty waking someone up in the middle of the night. These problems tend to pretty quickly fix themselves as we balance traffic from one SMS to the other SMS provider within 5 minutes. By downgrading both exceptions to warning in the case of a `SmsClientResponseException`, we will reduce the change of waking us up in the middle of the night for no reason. If the error is not a `SmsClientResponseException`, then we will still log at the exception level as before as this is more unexpected and we may want to be alerted sooner. What we still want to happen though is that let's say both SMS providers went down at the same time for 1 hour. We don't want our tasks to just sit there, retrying every 5 minutes for the whole time without us being aware (so we can at least raise a statuspage update). Luckily we will still be alerted because our smoke tests will fail after 10 minutes and raise a p1: https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21	2021-01-18 17:00:21 +00:00
David McDonald	1cfd19fb6a	Reorder tests So all tests related to sms sending is done first and then all email sending task tests happen after	2021-01-18 16:46:12 +00:00
Pea Tyczynska	22f2eb7bfe	Add notes column to services table	2021-01-18 10:36:51 +00:00
Chris Hill-Scott	4eb4ea1772	Use cache for tasks that save notifications These tasks need to repeatedly get the same template and service from the database. We should be able to improve their performance by getting the template and service from the cache instead, like we do in the REST endpoint code.	2021-01-18 10:25:24 +00:00
David McDonald	9c01d8018d	Merge pull request #3093 from alphagov/broadcast-tasks-onto-worker Broadcast tasks onto worker	2021-01-15 15:51:35 +00:00
Chris Hill-Scott	e161f6e4a1	Require reference if template not provided In the admin app we need something to use in show in lieu of template name when a template isn’t used. Let’s store this in the reference field for now.	2021-01-15 14:57:36 +00:00
Chris Hill-Scott	0510311d63	Don’t require template when content is provided So that the admin app can create broadcasts without a template it needs to be allowed to create broadcasts from content instead.	2021-01-15 14:57:36 +00:00
David McDonald	ff193387d1	Add proxy client for Three Three uses the One 2 Many technology so should work in the same way as our proxy for EE	2021-01-14 11:44:46 +00:00
David McDonald	f3ee2cdd48	Merge pull request #3086 from alphagov/lambda-errors Failover to second lambda on error	2021-01-14 11:15:34 +00:00
Pea Tyczynska	b5a33ded98	Retry with failover lambda for FunctionError and status > 299 For all FunctionErrors, and for invoke errors (status > 299) we want to retry with failover lambda. We are doing this, because if there is a connection or other error with one lambda, the failover lambda may still work and it's worth trying. With time, we will probably have more complex retry flow, depending on the error and even maybe differing for each MNO (broadcast provider).	2021-01-14 10:45:29 +00:00
David McDonald	20627d96ea	Put all broadcast tasks on the broadcast worker	2021-01-13 17:21:40 +00:00

1 2 3 4 5 ...

3821 Commits