notifications-api

mirror of https://github.com/GSA/notifications-api.git synced 2025-12-14 09:12:06 -05:00

Author	SHA1	Message	Date
Leo Hemsted	246016a894	don't log if we dont delete anything for a service we try and delete for lots of services. this includes services that don't actually have anything to delete that day. that might be because they had a custom data retention so we always go to check them, or because they only sent test notifications (which we'll delete but not include in the count in the log line). we don't really need to see log lines saying that we didn't delete anything for that service - that's just a long list of boring log messages that will hide the actual interesting stuff - which services we did delete content for.	2022-01-21 11:04:37 +00:00
Leo Hemsted	228d72dc8f	update log messages in delete task. less prose, clearer output. (hopefully)	2021-12-14 15:24:35 +00:00
Leo Hemsted	49cc1b643f	split delete task up into per service we really don't gain anything by running each service delete in sequence - we get the services, and then just loop through them deleting per service. By deleting per service in separate tasks, we can take advantage of parallelism. the only thing we lose is some log lines but I don't think we're that interested in them. only set query limit at the move_notifications dao function - the task doesn't really care about the technical implementation of how it deletes the notifications	2021-12-14 15:24:34 +00:00
Ben Thorner	c8cf057eba	Record providers we time out notifications for This will help us monitor issues with delivery receipts and keep track of provider performance over time. I'm not concerned about performance here: - The number of notifications to time out is usually small. - This task only runs once a day. - Calls to StatsD are quick and cheap.	2021-12-14 13:04:39 +00:00
Ben Thorner	87cd40d00a	Scale timeout task to work on arbitrary volumes Previously this was limited to 500K notifications. While we don't expect to reach this limit, it's not impossible e.g. if we had a repeat of the incident where one of our providers stopped sending us status updates. Although that's not great, it's worse if our code can't cope with the unexpectedly high volume. This reuses the technique we have elsewhere [1] to keep processing in batches until there's nothing left. Specifying a cutoff point means the total amount of work to do can't keep growing. [1]: `2fb432adaf/app/dao/notifications_dao.py (L441)`	2021-12-13 17:14:28 +00:00
Ben Thorner	76aeab24ce	Rewrite DAO timeout method to take cutoff_time Previously we specified the period and calculated the cutoff time in the function. Passing it in means we can run the method multiple times and avoid getting "new" notifications to time out in the time it takes to process each batch.	2021-12-13 16:56:21 +00:00
Ben Thorner	04da017558	DRY-up conditionally creating callback tasks This removes 3 duplicate instances of the same code, which is still tested implicitly via test_process_ses_receipt_tasks [1]. In the next commit we'll make this test more explicit, to reflect that it's now being reused elsewhere and shouldn't change arbitrarily. We do lose the "print" statement from the command instance of the code, but I think that's a very tolerable loss. [1]: `16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)`	2021-12-06 14:11:34 +00:00
Ben Thorner	0318229216	Stop 'timing out' old 'created' notifications This is being replaced with a new alert and runbook [1]. It's not always appropriate to change the status to 'technical-failure', and the new alert means we'll act to fix the underlying issue promptly. We'll look at tidying up the remaining code in the next commits. [1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created	2021-12-06 14:00:36 +00:00
Leo Hemsted	f6d210f1e6	put delete tasks on the reporting worker they share a lot with the reporting tasks (creating ft_billing and ft_notification_status), in that they're run nightly, take a long time, and we see error messages if they get run multiple times (due to visibility timeout). The periodic app has two concurrent processes - previously there was just one delete task, which would use one of those processes, while the other process would pick up anything else on the queue (at that time of night, the regular provider switch checks and scheduled job checks). However, when we switched to running the three delete notification types separately, we saw visibility timeout issues - three tasks would be created, all three would be picked up by one celery instance, the two worker processes would start on two of them, and the third would sit on the box, wait longer than the visibility timeout to be picked up (and acknowledged), and so SQS would assume the task was lost and replay it. it's queues all the way down! By putting them on the reporting worker we can take advantage of tuning that app (for example setting the prefetch multiplier to one) which is designed to run large tasks. We've also got more concurrent workers on this box, so we can run all three tasks at once.	2021-12-03 13:28:16 +00:00
Leo Hemsted	6bbec9f103	make delete notification tasks parallel by notification type we used to do this until apr 2020. Let's try doing it again. Back then, we had problems with timing. We did two things in spring 2020: We moved to using an intermediary temp table [1] We stopped the tasks being parallelised [2] However, it turned out the real time saving was from changing what services we delete for [3]. The task was actually CPU-bound rather than DB-bound, so that's probably why having the tasks in parallel wasn't helping, since they were all competing for the same CPU. It's worth trying the parallel steps again now that we're no longer CPU bound. Note: Temporary tables are in their own postgres schema, and are only viewable by the current session (session == connection. Each celery worker process has its own db connection). We don't need to worry about separate workers both trying to use the same table at once. I've also added a "DROP ON COMMIT" directive to the table definition just to ensure it doesn't persist past the task even if there's an exception. (This also drops on rollback). Cronitor looks at the three functions separately so we don't need to worry about the main task taking milliseconds where it used to take hours as it isn't monitored itself. I've also removed some unnecessary redundant exception logs. [1] https://github.com/alphagov/notifications-api/pull/2767 [2] https://github.com/alphagov/notifications-api/pull/2798 [3] https://github.com/alphagov/notifications-api/pull/3381	2021-12-01 14:28:08 +00:00
Ben Thorner	cdb43fbaf6	Only loop timeout task if there's more work Previously this would repeat the task even the current iteration of the loop had processed a non-full batch. This could cause the task to error incorrectly if one or two notifications breach the timeout threshold in between iterations.	2021-11-09 15:41:14 +00:00
Ben Thorner	77c8c0a501	Optimise query to get notifications to "time out" From experimenting in production we found a "!=" caused the engine to use a sequential scan, whereas explicitly listing all the types ensured an index scan was used. We also found that querying for many (over 100K) items leads to the task stalling - no logs, but no evidence of it running either - so we also add a limit to the query. Since the query now only returns a subset of notifications, we need to ensure the subsequent "update" query operates on the same batch. Also, as a temporary measure, we have a loop in the task code to ensure it operates on the total set of notifications to "time out", which we assume is less than 500K for the time being.	2021-11-09 13:50:32 +00:00
Ben Thorner	d1586a8f81	CC DVLA in tickets about outstanding letters Previously we sent them emails about this manually. We also tried a Zendesk macro/trigger approach, but using a CC means: - We can control the behaviour ourselves (Zendesk triggers can only be edited by admins outside our team). - We keep the DVLA notification approach consistent and in one place, so notifications always go to the same people. - Any further (public) updates to the ticket will also trigger a notification to DVLA (previous trigger only notified on creation).	2021-10-29 11:46:29 +01:00
Leo Hemsted	b8c4e19072	tweak zendesk message for no ack files alert include a link to a runbook entry. also the list of acknowledgement files can be very long, so make that the last thing, and use new lines to space out the message.	2021-10-08 13:45:02 +01:00
Katie Smith	2f66e38fb9	Update how "missing ackfile for letters" Zendesk tickets are created	2021-09-29 11:10:50 +01:00
Katie Smith	64c0a3fb9d	Update how 'letters still sending' Zendesk tickets are created These now use the new Zendesk form.	2021-09-29 11:07:37 +01:00
Ben Thorner	e3e067c795	Remove redundant @statsd timing decorators These are superseded by timing task execution generically in the NotifyTask superclass [1]. Note that we need to wait until we've gathered enough data under the new metrics before removing these. [1]: https://github.com/alphagov/notifications-api/pull/3201#pullrequestreview-633549376	2021-04-12 15:19:18 +01:00
David McDonald	41d95378ea	Remove everything for the performance platform We no longer will send them any stats so therefore don't need the code - the code to work out the nightly stats - the performance platform client - any configuration for the client - any nightly tasks that kick off the sending off the stats We will require a change in cronitor as we no longer will have this task run meaning we need to delete the cronitor check.	2021-03-15 12:04:53 +00:00
David McDonald	8325431462	Move saving of processing time into separate task We current do this as part of send-daily-performance-platform-stats but now this moves it into its own separate task. This is for two reasons - we will shortly get rid of the send-daily-performance-platform-stats task as we no longer will need to send anything to performance platform - even if we did decide to keep the task send-daily-performance-platform-stats and remove the specific bits that relate to the performance platform, it's probably nicer to rewrite the new task from scratch to make sure it's all clear and easy to understand	2021-03-15 11:44:01 +00:00
Ben Thorner	a91fde2fda	Run auto-correct on app/ and tests/	2021-03-12 11:45:45 +00:00
David McDonald	c3ef23c771	Alert on 2nd class letters still in sending everyday In `8285ef5f89` we turned off alerting on 2nd class letters still being in sending on certain days of the week because we were only sending letters out on Mon, Wed, Fri. Now we have swapped back to sending out 2nd class letters on all workdays so this change can be reverted. Note, I haven't reverted the commit exactly but more so the behaviour, whilst leaving in some tests to explicitly test 2nd class letters for the alert in case we change this again.	2021-01-13 11:21:27 +00:00
David McDonald	085e3a8435	Make deleting of notifications to be sequential Based on this https://github.com/alphagov/notifications-api/pull/2788 where some concerns were raised. This should be a quicker fix to get our the deletions to run sequentially for all the notification types. Note, email is first as most important as makes up the larger numbers (we wouldn't want it to start with SMS, fail half way for a reason that only affects SMS and for that to affect the email deletion). We hope that by running sequentially we will reduce conflicts writing to the same index and this will speed up the total time it takes to finish deleting all notification types older than their retention time. There is a risk that whilst quicker per job, as they now run sequentially rather than potentially overlapping, they will take longer overall. We will need to monitor to see.	2020-04-07 17:03:17 +01:00
Katie Smith	62b11bc61e	Delete delete_dvla_response_files_older_than_seven_days task This was not being used.	2020-04-02 14:49:47 +01:00
David McDonald	98d8388805	Add runbook link to ticket	2020-04-02 09:42:46 +01:00
David McDonald	b28fffc330	Refactor `raise_alert_if_letter_notifications_still_sending` Move finding of letter logic into a separate method so it can be unit tested rather than having to test it by checking the contents of a zendesk ticket api call. This will enable us to change the zendesk api ticket call message without needing to edit lots of tests.	2020-04-02 09:36:56 +01:00
David McDonald	a14d5f0225	Remove task that no longer runs We no longer puts files in these s3 buckets (and have in fact deleted the buckets) therefore this task is redundant and can be removed.	2020-02-06 10:57:43 +00:00
Leo Hemsted	8285ef5f89	only check for dvla response files on mon/weds/fri dvla don't process 2nd class files on tues and thurs	2019-10-08 18:16:45 +01:00
Leo Hemsted	5045590d75	allow you to pass in date to send perf stats make it easier to replay sending data for a day if it failed the first time round	2019-06-11 13:57:17 +01:00
Rebecca Law	4ce2b9eaba	The rstrip was not working for all file names so this changes it to a replace.	2019-04-08 12:04:14 +01:00
Rebecca Law	dc8159104e	Update letter_raise_alert_if_no_ack_file_for_zip for new DVLA file format When we send a zip file of letters to DVLA we expect them to send back an acknowledgement of those files. Previously they named the files like NOTIFY.20180202091254.ACK.TXT and the contents would contain the name of the zip file we sent with a date of when they got it. They have updated this format to mirror the format of the zip file because there was an instance where they sent 2 files of the same name so the later overwrote the first. Since the name matches our name, there is no need to get the file from S3 but just compare file names.	2019-04-03 11:03:42 +01:00
Leo Hemsted	3739d9055d	clean up usage of dates/datetimes in performance platform tasks * call variables unambiguous things like `start_time` or `bst_date` to reduce risk of passing in the wrong thing * simplify the count_dict object - remove nested dict and start_date fields as superfluous * use static datetime objects in tests rather than calculating them each time	2019-04-02 11:49:20 +01:00
Rebecca Law	1456aa7789	Fix for performance platform updates. Changed the query to get the performance platform stats from ft_notification_status. But the date used for the query needed to be a date, not datetime so the equality worked.	2019-04-01 12:03:57 +01:00
Leo Hemsted	38f0ea6cca	remove functions to not talk about 7 days remind us that data retention is flexible	2019-02-26 17:57:35 +00:00
Leo Hemsted	f5198bf71d	remove unnecessary job_types arg from remove_csv_files celery tasks	2019-01-22 10:31:37 +00:00
Leo Hemsted	754c65a6a2	create cronitor decorator that alerts if tasks fail make a decorator that pings cronitor before and after each task run. Designed for use with nightly tasks, so we have visibility if they fail. We have a bunch of cronitor monitors set up - 5 character keys that go into a URL that we then make a GET to with a self-explanatory url path (run/fail/complete). the cronitor URLs are defined in the credentials repo as a dictionary of celery task names to URL slugs. If the name passed in to the decorator isn't in that dict, it won't run. to use it, all you need to do is call `@cronitor(my_task_name)` instead of `@notify_celery.task`, and make sure that the task name and the matching slug are included in the credentials repo (or locally, json dumped and stored in the CRONITOR_KEYS environment variable)	2019-01-18 15:36:53 +00:00
Leo Hemsted	d3d56a3224	separate nightly tasks and other scheduled tasks. other tasks is anything that is run on a different frequency than nightly	2019-01-18 15:36:53 +00:00

36 Commits