When we send an HTTP request to our SMS providers, there is a
chance we get a 5xx status code back from them. Currently we log this as
two different exception level logs.
If a provider has a funny few minutes, we could end up with
hundreds of exceptions thrown and pagerduty waking someone up in the
middle of the night. These problems tend to pretty quickly fix
themselves as we balance traffic from one SMS to the other SMS provider
within 5 minutes.
By downgrading both exceptions to warning in the case of a
`SmsClientResponseException`, we will reduce the change of waking us up
in the middle of the night for no reason.
If the error is not a `SmsClientResponseException`, then we will still
log at the exception level as before as this is more unexpected and we
may want to be alerted sooner.
What we still want to happen though is that let's say both SMS providers
went down at the same time for 1 hour. We don't want our tasks to just
sit there, retrying every 5 minutes for the whole time without us being
aware (so we can at least raise a statuspage update). Luckily we will
still be alerted because our smoke tests will fail after 10 minutes and
raise a p1:
https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21
There are several reasons why we might get an `InvalidParameterValue`
from the SES API. One, as correctly identified before in
https://github.com/alphagov/notifications-api/pull/713/files
is if we allow an email address on our side that SES rejects.
However, there are other types of errors that could cause an
`InvalidParameterValue`. One example is a `Header too long: 'Subject'`
error that we have seen happen in production. This shouldn't raise an
`InvalidEmailError` as that is not appropriate.
Therefore, we introduce a new exception
`EmailClientNonRetryableException`, that represents any exception back
from an email client that we can use whenever we get a
`InvalidParameterValue` error.
Note, I chose `EmailClientNonRetryableException` rather than
`SESClientNonRetryableException` as our code needs to catch this
exception and it shouldn't be aware of what email client is being used,
it just needs to know that it came from one of the email clients (if in
time we have more than one).
In time, we may wish to extend the approach of having generic
`EmailClient` exceptions and `SMSClient` exceptions as this should be
the most extendable pattern and a good abstraction.
We have hit throttling limits from SES approximately once a week during
a spike of traffic from GOV.UK. The rate limiting usually only lasts a
couple of minutes but generates enough exceptions to cause a p1 but with
no potential action for the responder.
Therefore we downgrade the warning for this case to a warning and assume
traffic will level back out such that the problem resolves itself.
Note, we will still get exceptions if we go over our daily limit, rather
than our per minute sending limit, which does require immediate action
by someone responding.
If we were to continually go over our per second sending rate for a long
continous period of time, then there is a chance we may not be aware but
given the risk of this happening is low I think it's an acceptable risk
for the moment.
If SES raised an `InvalidParameterValue` error (because an email address
was wrong) we were logging an exception and setting the email status to
`technical-failure`. We now set it to `permanent-failure` instead and
change the log level to `info` - setting it to `permanent-failure` means
that people will know not to retry the message.
If the `deliver_sms` catches an exception when trying to send an SMS, we
want the first retry to happen immediately (because we will have
switched providers), then every retry after that to happen at the
standard intervals.
Before we had a long back off, now we have more, but shorter backoffs.
- PREVIOUS
When we had an error talking to a provider we retried quickly and if we still got errors we backed off more and more. Maximum attempts was 5, max delay 4hours. This was to allow us time to ship a build if that was required.
- NOW
Backing off 48 times of 5 minutes each. This gives us the same total backoff, but many more tries in that period.
- WHY
Having the long back off meant messages could be delayed 4 hours. This was happening more and more, as PaaS deploys can place things into the "inflight" state in SQS. The inflight state MUST have an expiry time LONGER than the maximum retry back off. This meant that messages would be delayed 4 hours, even when there was no app error.
By doing this we can reduce this delay to 5 minutes. Whilst still giving us time to fix issues.
the first argument to ANY logger.____ function is ALWAYS cast to a
string and used as a format argument for ALL remaining arguments
using %s formatting. even `logger.exception`, which just logs as
normal and then appends the stack trace.
so we shouldn't be passing `e` into logger.exception - just
`logger.exception('something went wrong!')`
also de-duplicated a test
- Thows a NoResultFound sqlalchemy exception
- Which causes a retry. This means we give it a few goes (5, max 5 hours) for the notification to appear.
- Should never happen, only if we get some task overlaps that are unusual that leads to tasks executed in an overlapping nature.
- Driven by the fact we won't know the type in the API call
- hence we need to load notification earlier , so pass it not the id through to the send task to avoid loading it twice.