* Two separate jobs, one for sms&email and another for letter
* Change celery task for delete to accept template type filter
* General refacor of tests to make more readable
- previously this was unbounded, so it got all jobs older then 7 days. In excess of 75,000 🔥
- this meant that the job took (a) a long time and (b) a lot memory and (c) doing the same thing every day
These changes mean that the job has a 2 day eligible window for jobs, minimising the number of eligible jobs in a run, whilst still retaining some leeway in event if it failing one night.
In principle the job runs early morning on a given day. The previous 7 days are left along, and then the previous 2 days worth of files are deleted:
so:
runs on
31st
30,29,28,27,26,25,24 are ignored
23,22 jobs here have files deleted
21 and earlier are ignored.
when given any log function with multiple parameters, the python logging utils
assume the first param is a format string and the rest are arguments to pass
in - we were passing in the exception object to `logger.exception`, however,
the purpose of .exception is to add the exception object itself - so we didn't
need to
this helps manage the transaction by keeping it inside one function in the dao,
so after the function completes you know that the transaction has been released
and concurrent processing can resume
we were running into issues where multiple beats queue up the
run_scheduled_jobs task at the same time, and concurrency issues with selecting
scheduled jobs causes both tasks to trigger processing of the job.
Use with_for_update, which calls through to the postgres SELECT ... FOR UPDATE
which locks other SELECT FOR UPDATES (ie other threads running same code) until
the rows are set to pending and the transaction completes - so the second
thread will not find any rows
We found that if the notifications were in created or pending they are not purged from notifications.
- New bulk update method to set all notificaitons with:
- a status = created|sending|pending to temporary-failure
- and is older then today minus SENDING_NOTIFICATIONS_TIMEOUT_PERIOD (in seconds)
- the scheduled task to timeout notifications use the new bulk update query.
- the task will be more efficient
- As before this is now driven from the notifications history table
- Removed from updates and create
- Signatures changes to removed unused params hits many files
- Also potential issue around rate limiting - we used to get the number sent per day from the stats table - which was a single row lookup, now we have to count this. This applies to EVERY API CALL. Probably not a good thing and should be addressed urgently.
- runs 1 min past the hour, every hour
- looks up all scheduled jobs that have a scheduled date in the past and adds them to the normal process job queue
- these are then processed as normal
Removed all existing statsd logging and replaced with:
- statsd decorator. Infers the stat name from the decorated function call. Delegates statsd call to statsd client. Calls incr and timing for each decorated method. This is applied to all tasks and all dao methods that touch the notifications/notification_history tables
- statsd client changed to prefix all stats with "notification.api."
- Relies on https://github.com/alphagov/notifications-utils/pull/61 for request logging. Once integrated we pass the statsd client to the logger, allowing us to statsd all API calls. This passes in the start time and the method to be called (NOT the url) onto the global flask object. We then construct statsd counters and timers in the following way
notifications.api.POST.notifications.send_notification.200
This should allow us to aggregate to the level of
- API or ADMIN
- POST or GET etc
- modules
- methods
- status codes
Finally we count the callbacks received from 3rd parties to mapped status.
This happens because more than one beat process was creating the timeout task, resulting in multiple workers running the same queries at the same time.