Previously we were catching one type of exception if something went
wrong adding a notification to the queue for high volume services.
In reality there are two types of exception so this adds a second
handler to cover both.
For context, this is code we changed experimentally as part of the
upgrade to Celery 5 [1]. At the time we didn't check how the new
exception compared to the old one. It turns out they behaved the
same and we were always vulnerable to the scenario now covered by
the second exception, where the behaviour has changed in Celery 5 -
testing with a large task invocation gives...
Before (Celery 3, large-ish task):
'process_job.apply_async(["a" * 200000])'...
boto.exception.SQSError: SQSError: 400 Bad Request
<?xml version="1.0"?><ErrorResponse xmlns="http://queue.amazonaws.com/doc/2012-11-05/"><Error><Type>Sender</Type><Code>InvalidParameterValue</Code><Message>One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.</Message><Detail/></Error><RequestId>96162552-cd96-5a14-b3a5-7f503300a662</RequestId></ErrorResponse>
Before (Celery 3, very large task):
<hangs forever>
After (Celery 5, large-ish task):
botocore.exceptions.ClientError: An error occurred (InvalidParameterValue) when calling the SendMessage operation: One or more parameters are invalid. Reason: Message must be shorter than 262144 bytes.
After (Celery 5, very large task):
botocore.parsers.ResponseParserError: Unable to parse response (syntax error: line 1, column 0), invalid XML received. Further retries may succeed:
b'HTTP content length exceeded 1662976 bytes.'
[1]: 29c92a9e54
These tasks need to repeatedly get the same template and service from
the database. We should be able to improve their performance by getting
the template and service from the cache instead, like we do in the REST
endpoint code.
Flake8 Bugbear checks for some extra things that aren’t code style
errors, but are likely to introduce bugs or unexpected behaviour. A
good example is having mutable default function arguments, which get
shared between every call to the function and therefore mutating a value
in one place can unexpectedly cause it to change in another.
This commit enables all the extra warnings provided by Flake8 Bugbear,
except for:
- the line length one (because we already lint for that separately)
- B903 Data class should either be immutable or use `__slots__` because
this seems to false-positive on some of our custom exceptions
- B902 Invalid first argument 'cls' used for instance method because
some SQLAlchemy decorators (eg `declared_attr`) make things that
aren’t formally class methods take a class not an instance as their
first argument
It disables:
- _B306: BaseException.message is removed in Python 3_ because I think
our exceptions have a custom structure that means the `.message`
attribute is still present
Matches the work done in other repos:
- https://github.com/alphagov/notifications-admin/pull/3172/files
This code is already restricted to SMS and email, so this is only to make it obvious if there is a code refactor down the line. Perhaps this is overkill and we back out this commit.
When we initially added a new task to persist the notifications for a high volume service we wanted to implement it as quickly as possible, so ignored SMS.
This will allow a high volume service to send SMS, the SMS will be sent to a queue to then persist and send the SMS, similar to emails.
At this point I haven't added a new application to consume the new save-api-sms-tasks. But we can add a separate application or be happy with how the app scales for both email and sms.
Years ago we started to implement a way to schedule a notification. We hit a problem but we never came up with a good solution and the feature never made it back to the top of the priority list.
This PR removes the code for scheduled_for. There will be another PR to drop the scheduled_notifications table and remove the schedule_notifications service permission
Unfortunately, I don't think we can remove the `scheduled_for` attribute from the notification.serialized method because out clients might fail if something is missing. For now I have left it in but defaulted the value to None.
Same as we’re doing for templates.
This means avoiding a database call, even for services that don’t hit
our API so often.
They’ll still need to go to the database for the API keys, because we’re
not comfortable putting the API key secrets in Redis.
But once a service has got its keys from the database we commit the
transaction, so the connection can be freed up until we need it again to
insert the notification.
Same as we’ve done for templates.
For high volume services this should mean avoiding calls to external
services, either the database or Redis.
TTL is set to 2 seconds, so that’s the maximum time it will take for
revoking an API key or renaming a service to propagate.
Some of the tests created services with the same service ID. This
caused intermittent failures because the cache relies on unique service
IDs (like we have in the real world) to key itself.
By serialising these straight away we can:
- not go back to the database later, potentially closing the connection
sooner
- potentially cache the serialised data, meaning we don’t touch the
database at all
For some reason our V1 get template response wraps the whole template in
a dictionary with one key, `'data'`:
0d99033889/app/template/rest.py (L166)
That means when the admin app caches the response it also caches it in
this format.
The API needs to do the same, otherwise it will be cacheing data with a
schema that the admin app isn’t expecting, and vice-versa.
After the commit we issue two calls to the db to get service and get notification. This is because after the commit the ORM wants to ensure that the data model objects are the latest.
So far this is just a proof of concept, but the letter flow needs to be updated and we should be able to get rid of research mode. And it needs some tidy up.
SQS fails if the message body is over 256kb. Normally our messages are
quite small, but if we're using the new save-api-email task with an
email that has a large body, we can get over that limit. If so, handle
the exception and fall back to the existing code path (saving to the
database and sending a deliver-email task, which only has a notification
id.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.
In a nutshell:
- If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
- create a save_api_email task to persist the notification
- return the notification
- New worker app to process the save_api_email tasks.
By adding `force=True` to request.get_json() the mime type is ignore. If the data is not valid json the method will return a `BadRequestError` we catch that and throw our own error with a clear error message "Invalid JSON supplied in POST data".
If the json is valid return the json data or an empty dict if None is passed in.
This PR improves the error messages if the json is invalid, previously, the error message was "None object type" message which is not very helpful.
Updated the DAO methods which return a single SMS sender and all SMS senders
to only return the non-archived senders. Changed the error raised in the Admin
interface from a SQLAlchemyError to a BadRequestError.
Updated the DAO methods which return a single email reply_to address and
all reply_to addresses to only return the non-archived addresses.
Changed the type of error that gets raised when using the Admin
interface to be BadRequestError instead of a SQLAlchemyError.