boto returns a `StreamingBody`[1] response rather than a json struct.
We're currently just logging things like "Error calling lambda
o2-1-proxy with function error <botocore.response.StreamingBody object
at 0x7f74cd6e02e8>" which is obviously less than ideal. Also make the
tests properly reflect this - annoyingly it appears like we can't use
moto to reliably test this interface as the moto `mock_lambda` decorator
needs you to be running inside a docker container??
[1] https://botocore.amazonaws.com/v1/documentation/api/latest/reference/response.html#botocore.response.StreamingBody
these will happen if, for example, you have issues connecting to AWS or
permission issues.
Still failover if we get one of these exceptions, as I think it might be
possible to have a problem only related to one of the lambdas.
Retry tasks if they fail to send a broadcast event. Note that each task
tries the regular proxy and the failover proxy for that provider. This
runs a bit differently than our other retries:
Retry with exponential backoff. Our other tasks retry with a fixed delay
of 5 minutes between tries. If we can't send a broadcast, we want to try
immediately. So instead, implement an exponential backoff (1, 2, 4, 8,
... seconds delay). We can't delay for longer than 310 seconds due to
visibility timeout settings in SQS, so cap the delay at that amount.
Normally we give up retrying after a set amount of retries (often 4
hours). As broadcast content is much more important than normal
notifications, we don't ever want to give up on sending them to phones...
...UNLESS WE DO!
Sometimes we do want to give up sending a broadcast though! Broadcasts
have an expiry time, when they stop showing up on peoples devices, so if
that has passed then we don't need to send the broadcast out.
Broadcast events can also be superceded by updates or cancels. Check
that the event is the most recent event for that broadcast message, if
not, give up, as we don't want to accidentally send out two conflicting
events for the same message.
This moves the hardcoding to test channels one step up to where we call
`create_and_send_broadcast`
We can then after this, start to differ whether we give it the 'test' or
'severe' channel based on the services channel setting.
This used to be hardcoded in the CBC proxy but now we will hardcode it
in the cbc_proxy_client.
In a future PR we can start choosing which channel a broadcast will go
to based on the channel configured for that broadcast service.
We want to rename the `bt-ee-1-proxy` lambda function to `ee-1-proxy`.
This change will need to be deployed at the same time that we change
the name of the lambda function in the Terraform.
o2 use One-2-many CBC so we can use the O2M/CAP client.
Once differences between CBCs have been worked out we can consolidate O2M clients to reduce duplication.
Signed-off-by: Richard Baker <richard.baker@digital.cabinet-office.gov.uk>
For all FunctionErrors, and for invoke errors (status > 299) we
want to retry with failover lambda.
We are doing this, because if there is a connection or other error
with one lambda, the failover lambda may still work and it's
worth trying.
With time, we will probably have more complex retry flow, depending
on the error and even maybe differing for each MNO (broadcast provider).
There are several reasons why we might get an `InvalidParameterValue`
from the SES API. One, as correctly identified before in
https://github.com/alphagov/notifications-api/pull/713/files
is if we allow an email address on our side that SES rejects.
However, there are other types of errors that could cause an
`InvalidParameterValue`. One example is a `Header too long: 'Subject'`
error that we have seen happen in production. This shouldn't raise an
`InvalidEmailError` as that is not appropriate.
Therefore, we introduce a new exception
`EmailClientNonRetryableException`, that represents any exception back
from an email client that we can use whenever we get a
`InvalidParameterValue` error.
Note, I chose `EmailClientNonRetryableException` rather than
`SESClientNonRetryableException` as our code needs to catch this
exception and it shouldn't be aware of what email client is being used,
it just needs to know that it came from one of the email clients (if in
time we have more than one).
In time, we may wish to extend the approach of having generic
`EmailClient` exceptions and `SMSClient` exceptions as this should be
the most extendable pattern and a good abstraction.
We shouldn't be logging PII so we should not log email addresses. We
remove the email address and just log the normal exception message.
Note, this meant before that you could see the email address and more
easily track down the notification ID in the database. Now instead, you
will need to search in the DB for notifications that have gone into
technical failure at the time of the log message (as we still don't
log the notification ID alongside the failure).
If we’re sending non-GSM characters, we need to mark the language in the
XML as Welsh (`cy-GB` in CAP, `Welsh` in IBAG).
Currently, the CBC proxy checks the content we’re sending, and then uses
an approximation based on ASCII to determine whether we’re sending any
non-GSM characters, and if so, sets the language appropriately.
Instead, we should can functionality from the notifications-utils repo
to determine the language. If any non-GSM characters are used, then the
we can set the language to Welsh.
We’ll need to update the proxy to look at this new language flag.
previously we'd see an error message in the logs:
`AttributeError: 'NoneType' object has no attribute 'status_code'`
because we were assuming the requests exception would always have a
response - it won't have a response if it wasn't able to create a
connection at all.
In IBAG format for broadcasts, we need to give sequential number
of previous message, and it needs to be formatted as a hex padded
with zeroes to be 8 character long.
This commit adds the necessary formatting.
previously we made some incorrect assumptions about set-up on staging
and prod - they currently don't have any cbc_proxy aws creds at all.
We shoudn't be attempting canaries or link tests when there's no AWS
infrastructure to connect to.
We also shouldn't bother writing a row into the database at all for the
broadcast_provider_message since we're not even attempting to send, and
we shouldn't get confused between messages that failed and messages we
never wanted to send at all.
this is a pretty big and convoluted refactor unfortunately.
Previously:
There was one global `cbc_proxy_client` object in apps. This class has
the information about how to invoke the bt-ee lambda, and handles all
calls to lambda. This includes calls to the canary too (which is a
separate lambda).
The future:
There's one global `cbc_proxy_client`. This knows about the different
provider functions and lambdas, and you'll need to ask this client for a
proxy for your chosen provider. call cbc_proxy_client.get_proxy('ee')`
and it'll return you a proxy that knows what ee's lambda function is,
how to transform any content in a way that is exclusive to ee, and in
future how to parse any response from ee.
The present:
I also cleaned up some duplicate tests.
I'm really not sure about the names of some of these variables - in
particular `cbc_proxy_client` isn't a client - it's more of a java style
factory, where you call a function on it to get the client of your
choice.
moved the lambda invocation to a separate function to keep DRY
asserts on exception types need to be outside of with blocks, or they
won't trip (as the exception will stop execution of the inner with
block). the asserts were also the wrong way round so fixed that.
A BroadcastEvent knows when an event was sent and should expire
We pass through these values directly to the CBC Proxy, because
BroadcastEvent knows how they should be formatted
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
When we ask the CBC Proxy to send a message, we should specify that we
want to send a real message, when we want a real message
We will do this by specifying the message_type which can have 4 types, 3
of which represent a real message:
| Name | Effect |
| ------ | ------------------------ |
| alert | Create an alert |
| update | Update an existing alert |
| cancel | Cancel an existing alert |
| test | Send a link test |
We will use message_type to represent the table above
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Richard <richard.baker@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
The CBC Proxy is essentially a lambda function which we invoke with
various arguments.
A way in which this can fail is that the notifications-api app invoking
the function may not be able, any longer, to invoke the function.
This could be caused by, for example:
* an egress restriction preventing access to eu-west-2.lambda.amazonaws.com
* a network partition preventing access to eu-west-2.lambda.amazonaws.com
* the app's credentials have been rotated or revoked
If we invoke a simple "canary" lambda function for which the app should
have access to invoke, and check it for failures, we will know quickly
if something is likely to be broken.
This is especially important for cell broadcasts compared to email/SMS
because we always have a baseline of traffic for email/SMS, and so any
failure is observed almost immediately. This is not true for CB where we
may expect to only see one CB message every week/month/quarter/year, as
opposed to every minute or second for email/SMS.
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
We have hit throttling limits from SES approximately once a week during
a spike of traffic from GOV.UK. The rate limiting usually only lasts a
couple of minutes but generates enough exceptions to cause a p1 but with
no potential action for the responder.
Therefore we downgrade the warning for this case to a warning and assume
traffic will level back out such that the problem resolves itself.
Note, we will still get exceptions if we go over our daily limit, rather
than our per minute sending limit, which does require immediate action
by someone responding.
If we were to continually go over our per second sending rate for a long
continous period of time, then there is a chance we may not be aware but
given the risk of this happening is low I think it's an acceptable risk
for the moment.
Also log detailed delivery status for firetext in the same place in addition
to it being logged from notifications_dao.
Logging detailed delivery statuses will help us see why messages
fail to deliver. In the future we could persist detailed delivery
status in the database.
if doc download returns a 403, that's a screw-up on our side. it's not
helpful to a notify user for that to be passed on. the only thing they
should care about is if it's a 400, because they uploaded a filetype we
don't allow.
Everything else should return 500 internal server error.
Since Pytest 5, `ExceptionInfo` objects (returned by `pytest.raises`) now
have the same `str` representation as `repr`. This means that `str(e)`
now needs to be changed to `str(e.value)`.
https://github.com/pytest-dev/pytest/issues/5412