notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-22 16:31:15 -05:00

Author	SHA1	Message	Date
sakisv	65c21f694c	Don't raise P1 for broadcasts This is happening on the AWS side now as part of alphagov/notifications-broadcasts-infra#267 - but we still want to keep the zendesk ticket as it contains useful context _and_ provides visibility to the team.	2021-09-09 16:44:19 +03:00
Ben Thorner	bf0bf4e31c	Favour new "areas" format for PagerDuty alerts Broadcasts created via the API [1] and the Admin app [2] should both now have this field set. It's also more informative to show this, and broadcasts created via the API don't have IDs anyway. There's a small risk that an old broadcast that gets approved won't have this data, but it's for information only and we intend to backfill all old broadcasts in the near future. [1]: `023a06d5fb` [2]: `7dbe3afa19`	2021-08-27 14:22:12 +01:00
Ben Thorner	8f39d476bd	Start dual running with "areas" and (area) "ids" This is necessary until: - The Admin app is using the new "areas(_2)" format to store and retrieve data. - We've migrated all existing broadcast messages to use the new format. Note that "areas" / "ids" isn't actually used for anything except printing out the PagerDuty message - it's not sent to the proxy [1]. [1]: `6edc6c70aa/app/celery/broadcast_message_tasks.py (L190-L193)`	2021-08-26 15:34:24 +01:00
Ben Thorner	cdc150de1b	Change link test task to trigger both lambda This modifies the previous "(_)send_link_test" method to trigger a link test for a specific lambda. We then call the method with both the primary and failover lambda in new orchestrator method. Since the _invoke_lambda function doesn't raise exceptions if it fails, there's no need to rescue anything in order to ensure the second link test / invocation runs as well. It doesn't testing for this, since it boils to an absence of code to raise any exception. Note that, like the other parent tests, we only check the new method works with a specific proxy client instance.	2021-07-19 16:00:56 +01:00
Ben Thorner	08f48379b4	Move ID generation into link test method Unlike the other IDs which are stored in the DB, this isn't relevant for the Celery task as it invokes a link test. Moving it into the proxy client will also enable us to generate a second ID in the next commits, where we start doing a link test for the failover lambda.	2021-07-19 16:00:55 +01:00
Ben Thorner	7fb65761f9	Always log when a lambda is invoked This replaces the previous single-purpose log about a link test with a more informative and generic one for all invocations.	2021-07-19 15:45:43 +01:00
Ben Thorner	b6774bf0f7	Generate Vodafone link test sequence nos in proxy Previously the Celery task to trigger a link test had to know about the special case of a sequence number for Vodafone. Since we're about to change the client to perform multiple tests it makes sense to give it the knowledge of how to generate number itself. Note that we have to import the db inline to avoid a circular import, since this module is itself imported by app/__init__.py. Other invocations of the Vodafone client use stored sequence numbers from the DB, which are called "message numbers" in that context. Since the two use cases are very different (even the names are different!), having them in two places shouldn't cause any confusion.	2021-07-19 15:43:36 +01:00
Rebecca Law	1bf5ce08b2	Add a error log for alert tasks. Many of the team members do not look at emails from zendesk, adding a current_app.logger.error message for things we care about to give developers a better chance of seeing them. I have purposely not added an erro log for `check_for_services_with_high_failure_rates_or_sending_to_tv_numbers` because it's not something we need to look at immediately.	2021-05-26 11:06:21 +01:00
Ben Thorner	23f4ae32df	Merge pull request #3214 from alphagov/check-broadcast-suspended Enforce service suspension for broadcasts	2021-04-28 15:01:11 +01:00
Katie Smith	e6357c91c9	Add more details to messages in send_broadcast_provider_message task This ensures that the log messages both contain broadcast_event id and broadcast_provider_message id. It also removes the broadcast_event reference since this isn't particularly useful in helping to find an event.	2021-04-20 15:34:49 +01:00
Katie Smith	c9c4bd8b44	Clarify log line when sending a link test It wasn't clear what the ID in the message was. It's not possible to add more details to the message - we don't create a broadcast message or event for a link test.	2021-04-20 14:54:53 +01:00
Ben Thorner	a2af8b052a	Split up authorisation vs. sequencing checks While both of these are integrity errors (since we should never reach this point in the code + data), this just means the original method comment is still relevant to what immediately follows it.	2021-04-19 17:13:15 +01:00
Ben Thorner	ee52e3e2c9	Mirror integrity checks from the API It makes sense to have these checks [1] here, since in future we may add other ways of creating a broadcast event and omit them. [1]: `3d71815956/app/broadcast_message/rest.py (L198)`	2021-04-19 17:13:13 +01:00
Ben Thorner	0070473f31	Check for suspension before sending a broadcast This mirrors the check we do for jobs, which are also a high-impact task [1]. While this shouldn't be possible, just like other checks we're adding it here to be doubly certain. [1]: `3d71815956/app/celery/tasks.py (L74)`	2021-04-19 17:13:12 +01:00
Ben Thorner	b2398fcaf4	Rename CBCProxyFatalException We only actually use this when the data we're working with is in an unexpected state, which is unrelated to the CBC Proxy. Using this name also means we can re-use this exception in the next commits. Note that we may still care if a broadcast message has expired, since it's not expected that someone would send one in this condition.	2021-04-19 17:13:05 +01:00
Ben Thorner	f85dad5acf	Merge pull request #3203 from alphagov/remove-statsd-decorators Remove redundant @statsd timing decorators	2021-04-14 10:04:04 +01:00
David McDonald	295162c81d	Move CBC proxy enable check This change will make our development environments closer to production even if they aren't hooked up to the CBC proxy lambda functions. Now in development, we will create the broadcast event and create tasks for each broadcast provider event. We will still not create actual broadcast provider message rows in the DB and talk to the CBC proxies. This should be helpful in development to catch any issues we introduce to do with sending broadcast messaging. In time we may wish to have some fake CBC proxies in the AWS tools account that we can interact with to make it even more realistic.	2021-04-12 17:05:41 +01:00
Ben Thorner	e3e067c795	Remove redundant @statsd timing decorators These are superseded by timing task execution generically in the NotifyTask superclass [1]. Note that we need to wait until we've gathered enough data under the new metrics before removing these. [1]: https://github.com/alphagov/notifications-api/pull/3201#pullrequestreview-633549376	2021-04-12 15:19:18 +01:00
Leo Hemsted	4a5b1c23bd	only send zendesk P1 for alerts we don't need to be re-notified when someone clicks cancel	2021-04-08 12:22:18 +01:00
Leo Hemsted	9bd8c0239c	look for 'live', not 'production' config['NOTIFY_ENVIRONMENT'] is hardcoded to `'live'` in the Live config class. The values as seen on the environment which we send real messages from: ``` >>> json.loads(os.environ['VCAP_APPLICATION'])['space_name'] # what cloudfoundry sets 'production' >>> os.environ['NOTIFY_ENVIRONMENT'] # we set this from cloudfoundry 'production' >>> current_app.config['NOTIFY_ENVIRONMENT'] # hardcoded in the Live config 'live' >>> current_app.config['NOTIFICATION_QUEUE_PREFIX'] # pulled from env var of same name 'live' >>> current_app.config['ENV'] # this is an unrelated flask variable 'production' ```	2021-04-08 12:17:22 +01:00
Leo Hemsted	df393e36c5	send a p1 when a broadcast goes out on production it's important to keep tabs on when these things leave our system. Sending a zendesk ticket that triggers a P1 is probably our simplest way of notifying the team when this happens (it's what we do with out of hours emergencies on the admin app too). We don't have any direct pagerduty integrations from the api app, but we already have the zendesk client hooked up. After broadcasts go live, we may want to change this to a P2 (but even then, there's arguments for keeping it P1 to start with I think). Don't cause a P1 if it goes out on staging as that might be MNOs testing.	2021-04-06 11:32:19 +01:00
Ben Thorner	a91fde2fda	Run auto-correct on app/ and tests/	2021-03-12 11:45:45 +00:00
David McDonald	3ea86bfb48	Remove hardcoded default to use test channel There is no need for a default now as every broadcast service has set on it which broadcast channel to use.	2021-02-23 17:15:07 +00:00
Leo Hemsted	0088bcd98b	only retry if the broadcast message task is in sending previously we would retry if the task was queued up for retry but the status is in "received-ack" or "received-err". We don't expect that a task will be retried after getting this status, but if there are duplicate tasks that could happen. Lets plan for the worst by saying "only process a retry if the task is currently in sending". this way, if a duplicate task is on retry and the first task goes through succesfully, the duplicate task will give up.	2021-02-18 12:03:36 +00:00
Katie Smith	6b8ebb3421	Fix linting errors	2021-02-16 09:03:38 +00:00
Leo Hemsted	4f89be6944	Revert "Merge pull request #3125 from alphagov/revert-retry" This reverts commit `6b9a50beff`, reversing changes made to `33f93dfea2`.	2021-02-09 17:01:04 +00:00
Leo Hemsted	bee0059e53	Revert "Merge pull request #3101 from alphagov/retry-broadcasts" This reverts commit `1bd99c779d`, reversing changes made to `d390eb2cac`.	2021-02-08 11:02:34 +00:00
Leo Hemsted	49e6ec1ead	Revert "Merge pull request #3123 from alphagov/retry-loop-fix" This reverts commit `541a765811`, reversing changes made to `6a9ac654a6`.	2021-02-08 11:01:33 +00:00
Leo Hemsted	d582e35471	dont try and send broadccast event if it's already in technical-failure this gives us an option to manually set a status in the database and avoid things being stuck in a retry loop forever	2021-02-05 12:52:37 +00:00
Leo Hemsted	0ddebc63a8	reduce broadcast retry delay to 4 mins and drop prefetch. ### The facts * Celery grabs up to 10 tasks from an SQS queue by default * Each broadcast task takes a couple of seconds to execute, or double that if it has to go to the failover proxy * Broadcast tasks delay retry exponentially, up to 300 seconds. * Tasks are acknowledged when celery starts executing them. * If a task is not acknowledged before its visibility timeout of 310 seconds, sqs assumes the celery app has died, and puts it back on the queue. ### The situation A task stuck in a retry loop was reaching its visbility timeout, and as such SQS was duplicating it. We're unsure of the exact cause of reaching its visibility timeout, but there were two contributing factors: The celery prefetch and the delay of 300 seconds. Essentially, celery grabs the task, keeps an eye on it locally while waiting for the delay ETA to come round, then gives the task to a worker to do. However, that worker might already have up to ten tasks that it's grabbed from SQS. This means the worker only has 10 seconds to get through all those tasks and start working on the delayed task, before SQS moves the task back into available. (Note that the delay of 300 seconds is translated into a timestamp based on the time you called self.retry and put the task back on the queue. Whereas the visibility timeout starts ticking from the time that a celery worker picked up the task.) ### The fix #### Set the max retry delay for broadcast tasks to 240 seconds Setting the max delay to 240 seconds means that instead of a 10 second buffer before the visibility timeout is tripped, we've got a 70 second buffer. #### Set the prefetch limit to 1 for broadcast workers This means that each worker will have up to 1 currently executing task, and 1 task pending execution. If it has these, it won't grab any more off the queue, so they can sit there without their visibility timeout ticking up. Setting a prefetch limit to 1 will result in more queries to SQS and a lower throughput. This might be relevant in, eg, sending emails. But the broadcast worker is not hyper-time critical. https://docs.celeryproject.org/en/3.1/getting-started/brokers/sqs.html?highlight=acknowledge#caveats https://docs.celeryproject.org/en/3.1/userguide/optimizing.html?highlight=prefetch#reserve-one-task-at-a-time	2021-02-05 12:49:51 +00:00
Leo Hemsted	bbae209200	check provider message status etc when sending rather than when retrying previously if we were deciding whether to retry or not, it meant that future events wouldn't have context of what the task is doing. We'd run into issues with not knowing what references to include when updating/cancelling in future events. Instead of deciding whether to retry or not, always retry. Instead, when any event sends, regardless of whether it is a first time or a retry, check the status of previous events for that broadcast message. There are a few things that will mean we don't send. * If the finishes_at time has already elapsed (ie: we have been trying to resend this message and haven't had any luck and now the data is obselete) * A previous event has no provider message (this means that we never picked the previous event off the queue for some reason) * A previous event has a provider message that has anything other than an ack response. This includes sending (the old message is currently being sent), and technical-failure/returned-error (the old message is currently in the retry loop, having experienced issues).	2021-02-03 18:11:52 +00:00
Leo Hemsted	96a0935d1c	update broadcast provider message status on success/error so we can distinguish errorring messages that are currently retrying from those that sent succesfully.	2021-02-03 18:03:16 +00:00
Leo Hemsted	3dcbfc3612	re-use existing provider message if task retries previously it would crash with a unique constraint error. now, grab the previous message.	2021-02-03 18:01:54 +00:00
Leo Hemsted	ac34fb9c05	retry sending broadcasts Retry tasks if they fail to send a broadcast event. Note that each task tries the regular proxy and the failover proxy for that provider. This runs a bit differently than our other retries: Retry with exponential backoff. Our other tasks retry with a fixed delay of 5 minutes between tries. If we can't send a broadcast, we want to try immediately. So instead, implement an exponential backoff (1, 2, 4, 8, ... seconds delay). We can't delay for longer than 310 seconds due to visibility timeout settings in SQS, so cap the delay at that amount. Normally we give up retrying after a set amount of retries (often 4 hours). As broadcast content is much more important than normal notifications, we don't ever want to give up on sending them to phones... ...UNLESS WE DO! Sometimes we do want to give up sending a broadcast though! Broadcasts have an expiry time, when they stop showing up on peoples devices, so if that has passed then we don't need to send the broadcast out. Broadcast events can also be superceded by updates or cancels. Check that the event is the most recent event for that broadcast message, if not, give up, as we don't want to accidentally send out two conflicting events for the same message.	2021-02-03 16:43:01 +00:00
David McDonald	f441d5b4ce	Add comment about service channels for updating	2021-02-03 11:37:02 +00:00
David McDonald	f90b479c8d	Use service setting to pick broadcast channel This falls back to the "test" channel if they do not have a ServiceBroadcastSetting for the moment, but we intend in future PRs to enforce that all broadcast services will have this property.	2021-02-01 14:10:41 +00:00
David McDonald	2aad3163e6	Allow CBC proxy client to take channel This moves the hardcoding to test channels one step up to where we call `create_and_send_broadcast` We can then after this, start to differ whether we give it the 'test' or 'severe' channel based on the services channel setting.	2021-02-01 14:10:38 +00:00
David McDonald	20627d96ea	Put all broadcast tasks on the broadcast worker	2021-01-13 17:21:40 +00:00
Pea Tyczynska	45b806f6db	Remove unused args from cancel broadcast call in tasks	2020-12-14 11:31:05 +00:00
Pea Tyczynska	def7a16765	Establish relation between provider message and message number this is so we can access brodcast_provider_message_number from BroadcastProviderMessage object	2020-12-09 11:41:22 +00:00
Pea Tyczynska	8af4b27fd6	Separate functions for cbc clients Also move message_format to the clients.	2020-12-09 11:13:50 +00:00
Pea Tyczynska	553565bc91	Send message format to CBC Either cap or ibag	2020-12-08 11:15:26 +00:00
Pea Tyczynska	2952b70930	Only create sequential numbers for Vodafone messages	2020-12-07 13:13:13 +00:00
Pea Tyczynska	e95dc9450e	Include message number in send_broadcast_provider_message	2020-12-07 13:13:12 +00:00
Pea Tyczynska	a186d2d296	Format sequential number into an 8 char long hex As per Vodafone spec for ibag format message number	2020-12-07 13:13:11 +00:00
Pea Tyczynska	b34bffaae6	Sends sequential number to Vodafone as link test	2020-12-07 13:13:11 +00:00
Leo Hemsted	fd335e3d8b	move available provider logic to the service model make sure it's in an accessible place so we don't end up duplicating our work	2020-12-03 22:50:50 +00:00
Leo Hemsted	72f8a15d4f	respect service broadcast provider restrictions when sending	2020-12-03 13:39:09 +00:00
Leo Hemsted	e2fa0116a0	add CBC_PROXY_ENABLED config flag to control if tasks are triggered previously we made some incorrect assumptions about set-up on staging and prod - they currently don't have any cbc_proxy aws creds at all. We shoudn't be attempting canaries or link tests when there's no AWS infrastructure to connect to. We also shouldn't bother writing a row into the database at all for the broadcast_provider_message since we're not even attempting to send, and we shouldn't get confused between messages that failed and messages we never wanted to send at all.	2020-11-26 10:16:22 +00:00
Leo Hemsted	087cc5053d	separate cbc proxy into separate clients this is a pretty big and convoluted refactor unfortunately. Previously: There was one global `cbc_proxy_client` object in apps. This class has the information about how to invoke the bt-ee lambda, and handles all calls to lambda. This includes calls to the canary too (which is a separate lambda). The future: There's one global `cbc_proxy_client`. This knows about the different provider functions and lambdas, and you'll need to ask this client for a proxy for your chosen provider. call cbc_proxy_client.get_proxy('ee')` and it'll return you a proxy that knows what ee's lambda function is, how to transform any content in a way that is exclusive to ee, and in future how to parse any response from ee. The present: I also cleaned up some duplicate tests. I'm really not sure about the names of some of these variables - in particular `cbc_proxy_client` isn't a client - it's more of a java style factory, where you call a function on it to get the client of your choice.	2020-11-19 15:50:37 +00:00

1 2

64 Commits