When we send an HTTP request to our SMS providers, there is a
chance we get a 5xx status code back from them. Currently we log this as
two different exception level logs.
If a provider has a funny few minutes, we could end up with
hundreds of exceptions thrown and pagerduty waking someone up in the
middle of the night. These problems tend to pretty quickly fix
themselves as we balance traffic from one SMS to the other SMS provider
within 5 minutes.
By downgrading both exceptions to warning in the case of a
`SmsClientResponseException`, we will reduce the change of waking us up
in the middle of the night for no reason.
If the error is not a `SmsClientResponseException`, then we will still
log at the exception level as before as this is more unexpected and we
may want to be alerted sooner.
What we still want to happen though is that let's say both SMS providers
went down at the same time for 1 hour. We don't want our tasks to just
sit there, retrying every 5 minutes for the whole time without us being
aware (so we can at least raise a statuspage update). Luckily we will
still be alerted because our smoke tests will fail after 10 minutes and
raise a p1:
https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21
Also log detailed delivery status for firetext in the same place in addition
to it being logged from notifications_dao.
Logging detailed delivery statuses will help us see why messages
fail to deliver. In the future we could persist detailed delivery
status in the database.
we're using statsd to monitor how long provider requests are taking.
However, there's lots of busy work that happens inside our statsd
metrics timing window. Things like json dumping and loading, building
headers, exception handling, etc.
for firetext/mmg, the response object from requests has an elapsed
property [1], which captures from sending raw data to parsing the
response headers. for ses, it's a bit trickier, but boto3 exposes a few
event hooks [2]. it's hard to find them without stepping through the
code, but the interesting ones are before-call, after-call,
after-call-error, request-created, and response-received. The
before-call and after-call involve some marshalling, built-in retrying,
etc, while request-created and response-received are much lower level.
They might be called more than once per ses request, if boto3 itself
retries the request on 5xx, 429 and low level socket errors [3].
Add these as new `raw-request-time` metrics rather than overwriting to
avoid changing the meaning of an existing metric, and to let us compare
the metrics to see if there's a noticeable difference at all
[1] https://requests.readthedocs.io/en/master/api/#requests.Response.elapsed
[2] https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
[3] https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#legacy-retry-mode
Similar to MMG, there's a new env variable FIRETEXT_URL that can be
used to override the Firetext api URL.
This will be used to stub out both providers during the load test or
can be used to run a local API against a fake provider endpoint.
> On Python 3.3 or newer, monotonic will be an alias of time.monotonic
> from the standard library. On older versions, it will fall back to an
> equivalent implementation.
– https://pypi.org/project/monotonic/
- If the SMS client sends a status code that we do not recognize raise a ClientException and set the notification status to technical-failure
- Simplified the code in process_client_response, using a simple map.
- This is a CONNECT and a READ timeout.
- Gets wrapped in the standard client exception, with a status code of 504, message Gateway timeout.
- Is quiet noisy in logs to allow us to see it
- Ensures we flick across the provider.
To test change the timeout to 0 and it will timeout.
In order to set a message as temporary-failure, we check if it is in pending status first.
Otherwise a delivery receipt for failure is set to permanent failure.
- new client for statsd, follows conventions used elsewhere for configuration
- client wraps underlying library so we can use a config property to send/not send statsd
Added statsd metrics for:
- count of API successful calls SMS/Email
- count of successful task execution for SMS/Email
- count of errors from Client libraries
- timing of API calls to third party clients
- timing of how long messages live on the SQS queue
Refactor process_firetext_responses
Removed the abstract ClientResponses for firetext and mmg. There is a map for each response to handle the status codes sent by each client.
Since MMG has about 20 different status code, none of which seem to be a pending state (unlike firetext that has 3 status one for pending - network delay).
For MMG status codes, look for 00 as successful, everything else is assumed to be a failure.
- when a provider callback occurs and we update the status of the notification, also update the statistics table
Adds:
- Mapping object to the clients to handle mapping to various states from the response codes, this replaces the map.
- query lookup in the DAO to get the query based on response type / template type
Tests around rest class and dao to check correct updating of stats
Missing:
- multiple client callbacks will keep incrementing the counts of success/failure. This edge case needs to be handle in a future story.
- client updated to raise errors with fire text error codes/messages
New endpoint
- /notifications/sms/firetext
For delivery notifications to be sent to.