Previously there were 4 queues for sending messages
The was based on the fact that each notification has 2 actions - persist in the database and send to provider.
Two queues supported the CSV upload - for the first of these tasks
- bulk-email
- build-sms
And there were two more queues for the tasks that make the 3rd party client calls.
- sms
- email
API Calls just used the latter two queues for both tasks
Added four new queues
- db-email
- db-sms
- send-sms
- send-email
So an API call puts a notification into the db-[type] queue first, which then puts the notification into the send-[type] queue
Build queues stay as before.
This will allow us to target processing of these tasks with separate workers to manage these differently.
- runs 1 min past the hour, every hour
- looks up all scheduled jobs that have a scheduled date in the past and adds them to the normal process job queue
- these are then processed as normal
This allows us to prefix metrics with the environment to allow stats from staging and live to go to the same statsd, and alls us to filter in the dashboard by environment.
Removed all existing statsd logging and replaced with:
- statsd decorator. Infers the stat name from the decorated function call. Delegates statsd call to statsd client. Calls incr and timing for each decorated method. This is applied to all tasks and all dao methods that touch the notifications/notification_history tables
- statsd client changed to prefix all stats with "notification.api."
- Relies on https://github.com/alphagov/notifications-utils/pull/61 for request logging. Once integrated we pass the statsd client to the logger, allowing us to statsd all API calls. This passes in the start time and the method to be called (NOT the url) onto the global flask object. We then construct statsd counters and timers in the following way
notifications.api.POST.notifications.send_notification.200
This should allow us to aggregate to the level of
- API or ADMIN
- POST or GET etc
- modules
- methods
- status codes
Finally we count the callbacks received from 3rd parties to mapped status.
10 seconds, 1 minute, 5 minutes, 1 hour and 4 hours.
Total elapsed wait is max 5 hours 6 minutes and 10 seconds.
Changed visibility window of SQS to be 4 hours 10 seconds, longer the max retry period.
Including sms code, email verification, invitation emails, and password reset emails
Added the service id and template ids to config.
Small change to dao_create_service to use id if set.
- if a service is in research mode the don't send the notifications via the providers (MMG/SES/etc)
- instead set up a task to mimic those services callbacks - this completes the loop, and show stats, delivery receipts and so on.
- Use the "to" field to choose the response, allows users to create successful and errored notifications
temp fail sms, uses "07833333333"
perm fail sms, uses = "07822222222"
success = "07811111111" (or anything else)
success email = "delivered@simulator.notify"
perm fail = "perm-fail@simulator.notify"
temp fail = "temp-fail@simulator.notify"
- new client for statsd, follows conventions used elsewhere for configuration
- client wraps underlying library so we can use a config property to send/not send statsd
Added statsd metrics for:
- count of API successful calls SMS/Email
- count of successful task execution for SMS/Email
- count of errors from Client libraries
- timing of API calls to third party clients
- timing of how long messages live on the SQS queue
This means we will retain notifications for a full week and not
delete records that are 7 x 24 hours older than the time of the run of
the deletion task.
Also the task only needs to run once a day now, so I have changed
the celery config for the deletion tasks.
Removed the validation in the schema - it was adding complexity, let the unique constraint on the db throw the exception. This should only ever happen on a race condition which seems unlikely (two people changing a service to the same name at the same time)
Do no set debug=true to the test config. If debug=true it changes the behaviour of the error handlers, throwing the exception rather than returning a 500.