Before we had a long back off, now we have more, but shorter backoffs.
- PREVIOUS
When we had an error talking to a provider we retried quickly and if we still got errors we backed off more and more. Maximum attempts was 5, max delay 4hours. This was to allow us time to ship a build if that was required.
- NOW
Backing off 48 times of 5 minutes each. This gives us the same total backoff, but many more tries in that period.
- WHY
Having the long back off meant messages could be delayed 4 hours. This was happening more and more, as PaaS deploys can place things into the "inflight" state in SQS. The inflight state MUST have an expiry time LONGER than the maximum retry back off. This meant that messages would be delayed 4 hours, even when there was no app error.
By doing this we can reduce this delay to 5 minutes. Whilst still giving us time to fix issues.
the first argument to ANY logger.____ function is ALWAYS cast to a
string and used as a format argument for ALL remaining arguments
using %s formatting. even `logger.exception`, which just logs as
normal and then appends the stack trace.
so we shouldn't be passing `e` into logger.exception - just
`logger.exception('something went wrong!')`
also de-duplicated a test
- Thows a NoResultFound sqlalchemy exception
- Which causes a retry. This means we give it a few goes (5, max 5 hours) for the notification to appear.
- Should never happen, only if we get some task overlaps that are unusual that leads to tasks executed in an overlapping nature.
- Driven by the fact we won't know the type in the API call
- hence we need to load notification earlier , so pass it not the id through to the send task to avoid loading it twice.
- Aim to move the code that contacts providers into it's own module.
- Celery tasks now call this module to send to provider
- No exceptions caught in the new module. Celery tasks now use any exception to trigger a retry.
- tests moved about - new test directory for the new class, all tests from celery test module moved, excepting the retry logic.
- As before this is now driven from the notifications history table
- Removed from updates and create
- Signatures changes to removed unused params hits many files
- Also potential issue around rate limiting - we used to get the number sent per day from the stats table - which was a single row lookup, now we have to count this. This applies to EVERY API CALL. Probably not a good thing and should be addressed urgently.
- again these new come from the notifications history table
- We update this when we sent a notification, so removed from celery tasks
- tests removed also
- "RETRY" prefixes the messages
In event of the retry attempts completing without successfully completing the task identify message as such
- "RETRY FAILED" prefixes the messages
Applies to the send_sms|send_email and send_sms_to_provider|send_email_to_provider tasks
These are there to try and ensure we can alert on these events so that we know if we have started retrying messages
Retry messages also contain notification ids to aid debugging.
Removed all existing statsd logging and replaced with:
- statsd decorator. Infers the stat name from the decorated function call. Delegates statsd call to statsd client. Calls incr and timing for each decorated method. This is applied to all tasks and all dao methods that touch the notifications/notification_history tables
- statsd client changed to prefix all stats with "notification.api."
- Relies on https://github.com/alphagov/notifications-utils/pull/61 for request logging. Once integrated we pass the statsd client to the logger, allowing us to statsd all API calls. This passes in the start time and the method to be called (NOT the url) onto the global flask object. We then construct statsd counters and timers in the following way
notifications.api.POST.notifications.send_notification.200
This should allow us to aggregate to the level of
- API or ADMIN
- POST or GET etc
- modules
- methods
- status codes
Finally we count the callbacks received from 3rd parties to mapped status.
dont send reply_to_addresses around from process_job and send_email -
take it from the service in send_email_to_provider. also clean up
the kwarg in aws_ses.send_email to more accurately reflect what we
might pass in
If the notification ends up in the retry queue and the delivery app is restarted the notification will get sent twice.
This is because when the app is restarted another message will be in the retry queue as message available which is a
duplicate of the one in the queue that is a message in flight.
This should complete https://www.pivotaltracker.com/story/show/121844427
10 seconds, 1 minute, 5 minutes, 1 hour and 4 hours.
Total elapsed wait is max 5 hours 6 minutes and 10 seconds.
Changed visibility window of SQS to be 4 hours 10 seconds, longer the max retry period.