In response to: [^1].
The stacktrace conveys the same and more information. We don't do
anything different for each exception class, so there's no value
in having three of them over one exception.
I did think about DRYing-up the duplicate exception behaviour into
the base class one. This isn't ideal because the base class would
be making assumptions about how inheriting classes make requests,
which might change with future providers. Although it might be nice
to have more info in the top-level message, we'll still get it in
the stacktrace e.g.
ValueError: Expected 'code' to be '0'
During handling of the above exception, another exception occurred:
app.clients.sms.SmsClientResponseException: SMS client error (Invalid response JSON)
requests.exceptions.ReadTimeout
During handling of the above exception, another exception occurred:
app.clients.sms.SmsClientResponseException: SMS client error (Request failed)
[^1]: https://github.com/alphagov/notifications-api/pull/3493#discussion_r837363717
This works in conjunction with the new SMS provider stub [^1].
Local testing:
- Run the migrations to add Reach as an inactive provider.
- Activate the Reach provider locally and deactivate the others.
update provider_details set priority = 100, active = false where notification_type = 'sms';
update provider_details set active = true where identifier = 'reach';
- Tweak your local environment to point at the SMS stub.
export REACH_URL="http://host.docker.internal:6300/reach"
- Start / restart Celery to pick up the config change.
- Send a SMS via the Admin app and see the stub log it.
- Reset your environment so you can send normal SMS.
update provider_details set active = true where notification_type = 'sms';
update provider_details set active = false where identifier = 'reach';
[^1]: https://github.com/alphagov/notifications-sms-provider-stub/pull/10
This avoids duplicating it as we add a new provider and means we
can test it all in one place (although it wasn't tested before).
I'm not sure why the previous code did "super(..)__init__" in a
non-init function - it's a bit late! - so I've just replaced it
with a call to the new "init_app" function in the parent class.
This reduces the code to copy when we add a new provider. I don't
think we need to log the URL or status code each time:
- The URL is always the same.
- A "200" status code is implicit in "success".
- Other status codes will be reported as exceptions.
Removing these specific elements means "record_outcome" is generic
and can be de-duplicated in the base class.
This is never overridden and can't be used in practie because all
SMS clients have to use the same interface. Removing it will make
it possible to DRY-up some of the code in this method.
Previously we used a combination of "provider.name" and "get_name()"
which was confusing. Using a non-property function also gave me the
impression that the name was more dynamic than it actually is.
This is enough to update a notification in DB:
1. First create a notification in the UI and sent it.
2. Then reset its attributes to pretend it's for Reach.
update notifications set
sent_at = null,
sent_by = null,
notification_status='sending'
where id='some-uuid';
3. Change "notification_id" to "<some-uuid>" in the code.
4. Call the boilerplate endpoint for Reach callbacks.
curl -X POST localhost:6011/notifications/sms/reach
Interestingly there's no foreign key constraint on "sent_by" in the
DB, so this just works: the notification is updated.
This reverts commit f2f2509c9b.
Raw request stats were added to investigate a hunch about a
performance issue we were seeing [1], but turned out not to
be relevant. We don't use them anymore so we can tidy up.
[1]: https://github.com/alphagov/notifications-api/pull/2858
This avoids any issues due to large payloads (e.g. with a lot of
polygons in the 'areas' field). While we may miss part of the log
in such cases, this is more than we get already anyway.
This modifies the previous "(_)send_link_test" method to trigger a
link test for a specific lambda. We then call the method with both
the primary and failover lambda in new orchestrator method.
Since the _invoke_lambda function doesn't raise exceptions if it
fails, there's no need to rescue anything in order to ensure the
second link test / invocation runs as well. It doesn't testing for
this, since it boils to an absence of code to raise any exception.
Note that, like the other parent tests, we only check the new method
works with a specific proxy client instance.
Unlike the other IDs which are stored in the DB, this isn't relevant
for the Celery task as it invokes a link test. Moving it into the
proxy client will also enable us to generate a second ID in the next
commits, where we start doing a link test for the failover lambda.
We want to get the point where we're running link tests for each
lambda independently. The tests weren't checking for the failover
mechanism for link tests, so we can just remove it.
Previously the Celery task to trigger a link test had to know about
the special case of a sequence number for Vodafone. Since we're about
to change the client to perform multiple tests it makes sense to give
it the knowledge of how to generate number itself.
Note that we have to import the db inline to avoid a circular import,
since this module is itself imported by app/__init__.py.
Other invocations of the Vodafone client use stored sequence numbers
from the DB, which are called "message numbers" in that context. Since
the two use cases are very different (even the names are different!),
having them in two places shouldn't cause any confusion.
We want to start using Firetext for sending international SMS. They
require us to use a different API key for international SMS because it
requires a new code path to switch the sender ID to something that the
country will accept.
This PR does not include switching the sender of international SMS to
Firetext but sets us up to do so.
We only actually use this when the data we're working with is in an
unexpected state, which is unrelated to the CBC Proxy. Using this
name also means we can re-use this exception in the next commits.
Note that we may still care if a broadcast message has expired, since
it's not expected that someone would send one in this condition.
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats
We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
boto returns a `StreamingBody`[1] response rather than a json struct.
We're currently just logging things like "Error calling lambda
o2-1-proxy with function error <botocore.response.StreamingBody object
at 0x7f74cd6e02e8>" which is obviously less than ideal. Also make the
tests properly reflect this - annoyingly it appears like we can't use
moto to reliably test this interface as the moto `mock_lambda` decorator
needs you to be running inside a docker container??
[1] https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
these will happen if, for example, you have issues connecting to AWS or
permission issues.
Still failover if we get one of these exceptions, as I think it might be
possible to have a problem only related to one of the lambdas.
By creating a new `CBCProxyOne2ManyClient` class for the three One2Many
clients to inherit from. These three clients are the same apart from the
`lambda_name` and the `failover_lambda_name`.
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:
Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.
Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...
...UNLESS WE DO!
Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.
Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`
We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
This used to be hardcoded in the CBC proxy but now we will hardcode it
in the cbc_proxy_client.
In a future PR we can start choosing which channel a broadcast will go
to based on the channel configured for that broadcast service.
We want to rename the `bt-ee-1-proxy` lambda function to `ee-1-proxy`.
This change will need to be deployed at the same time that we change
the name of the lambda function in the Terraform.
o2 use One-2-many CBC so we can use the O2M/CAP client.
Once differences between CBCs have been worked out we can consolidate O2M clients to reduce duplication.
Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
When we send an HTTP request to our SMS providers, there is a
chance we get a 5xx status code back from them. Currently we log this as
two different exception level logs.
If a provider has a funny few minutes, we could end up with
hundreds of exceptions thrown and pagerduty waking someone up in the
middle of the night. These problems tend to pretty quickly fix
themselves as we balance traffic from one SMS to the other SMS provider
within 5 minutes.
By downgrading both exceptions to warning in the case of a
`SmsClientResponseException`, we will reduce the change of waking us up
in the middle of the night for no reason.
If the error is not a `SmsClientResponseException`, then we will still
log at the exception level as before as this is more unexpected and we
may want to be alerted sooner.
What we still want to happen though is that let's say both SMS providers
went down at the same time for 1 hour. We don't want our tasks to just
sit there, retrying every 5 minutes for the whole time without us being
aware (so we can at least raise a statuspage update). Luckily we will
still be alerted because our smoke tests will fail after 10 minutes and
raise a p1:
https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21
No point trying, it's the same lambda. As `_invoke_lambda` currently
takes a bytes payload, rather than a json payload, it meant the decision
between encoding the payload in the canary or moving the encoding into
the `_invoke_lambda` function. We decided to go for the former as the
lesser of two evils. We may end up doing the encoding twice in the case
of a failover but this avoids us having to put the encoding in our code
in several places (for example the canary and also soon to be the link
tests).