This means that anyone adding a new test to this file doesn’t have to
remember to clear the cache in their test, or forget to and have a
hard-to-debug test failure.
Using `setup_function` means we don’t have to convert this module into
using class-based tests.
There are several reasons why we might get an `InvalidParameterValue`
from the SES API. One, as correctly identified before in
https://github.com/alphagov/notifications-api/pull/713/files
is if we allow an email address on our side that SES rejects.
However, there are other types of errors that could cause an
`InvalidParameterValue`. One example is a `Header too long: 'Subject'`
error that we have seen happen in production. This shouldn't raise an
`InvalidEmailError` as that is not appropriate.
Therefore, we introduce a new exception
`EmailClientNonRetryableException`, that represents any exception back
from an email client that we can use whenever we get a
`InvalidParameterValue` error.
Note, I chose `EmailClientNonRetryableException` rather than
`SESClientNonRetryableException` as our code needs to catch this
exception and it shouldn't be aware of what email client is being used,
it just needs to know that it came from one of the email clients (if in
time we have more than one).
In time, we may wish to extend the approach of having generic
`EmailClient` exceptions and `SMSClient` exceptions as this should be
the most extendable pattern and a good abstraction.
We shouldn't be logging PII so we should not log email addresses. We
remove the email address and just log the normal exception message.
Note, this meant before that you could see the email address and more
easily track down the notification ID in the database. Now instead, you
will need to search in the DB for notifications that have gone into
technical failure at the time of the log message (as we still don't
log the notification ID alongside the failure).
There seems to be some kind of complication in this script that doesn't
allow it to terminate properly.
This is being removed for now to allow deploying the rest of the fixes
in time for the holiday period.
This will make it impossible to create a new client without at least
having to define these properties. Which should get someone thinking
about language support…
If we’re sending non-GSM characters, we need to mark the language in the
XML as Welsh (`cy-GB` in CAP, `Welsh` in IBAG).
Currently, the CBC proxy checks the content we’re sending, and then uses
an approximation based on ASCII to determine whether we’re sending any
non-GSM characters, and if so, sets the language appropriately.
Instead, we should can functionality from the notifications-utils repo
to determine the language. If any non-GSM characters are used, then the
we can set the language to Welsh.
We’ll need to update the proxy to look at this new language flag.
On monday, we had a build of emails in the email queue that weren't
getting picked up by the sender worker and causing delays.
After further investigation with Andy from the PaaS, we believe the
following happened.
We received a bunch of traffic at 8:30ish which consisted of some
very large emails in terms of their length and complexity. The amount
of memory used by the app instances got very high and a few apps
crashed due to OOM (recorded by 5 cf app event crashes). When new
app instances tried to spin up, they weren't able to as they
potentially also ran out of memory immediately.
This left us in the position of having fewer app instances than we
needed, on top of which they were all using a very large amount of
CPU and may have been limited how quickly an individual app
instance would process tasks. This meant that we were overall
processing fewer tasks then we needed to and our queue of emails
started to build up.
So it appears our sender workers did not have the memory available that
they needed. By looking at a graph for the past 30 days of memory usage
on the sender workers, we see that it on several days breached 90%
memory usage for long periods of time. This in combination of the
hypothesis above of what happened leads us to decide that we want to
give the app instances a bigger memory quota so it has been upped from
3GB to 4GB.
Whilst doing, I also looked at long term memory usage graphs for our
other workers and saw that the letters worker was similarly close to
around 90% of memory used so have taken the opportunity to bump that
too.
Was failing when ran after 5:30pm as this would cause the letters to be
in a different subfolder (for one day later). Solved by freezetiming it
Example build that failed: https://cd.gds-reliability.engineering/builds/1876957
We are using our custom logger to log to `NOTIFY_LOG_PATH`, so this
logging from celery is neither needed nor desired.
We also need to define the location of the pidfiles, because of what
appears to be a bug in celery where it uses the location of logs to
infer the location of the pidfiles if it is not defined, i.e. in this
case it was trying to find the pidfiles in `/dev/null/%N.pid`.
For every email or text message we send we have to work out which
provider to send it with. Every time we do this we go and load the list
of providers from the database.
For emails, the result will always be the same.
For text messages the result is randomly chosen to balance the load
between the providers.
For international text messages the result is always the same (we only
have one international text message provider).
This commit adds an in-memory cache with a 2 second TTL so that we’re
not fetching the providers from the database every time, which should
speed things up a bit.
This does mean that, for text messages, the random choice will ‘stick’
for two seconds on each instance, before being re-chosen. I think this
is OK because it will even out to the same distribution over time.
I really don’t like having to clear the cache in the tests, so would
welcome suggestions on a better way of doing this…
We already serialise it in the templates response. We should make sure
the field is also present in the history response, if we want to use
cached template versions when processing notifications.
At the moment the two are duplicative, and it’s not clear which takes
precedence.
We need to keep the `excludes` line otherwise running the `flake8`
command takes ~3 times longer and may lint 3rd party files that we don’t
control.
previously we'd see an error message in the logs:
`AttributeError: 'NoneType' object has no attribute 'status_code'`
because we were assuming the requests exception would always have a
response - it won't have a response if it wasn't able to create a
connection at all.
We can add a request_id for tasks that are not spawned by an HTTP request, for example scheduled or nightly tasks. That means you can match up all the tasks spawned by a single task, for example, create-night-billing spawns 4 tasks, those would all have the same idea. Not sure if that is helpful or not. Also it might be confusing to have a request_id for logs that were not started from a request so I have left it out.
Flake8 Bugbear checks for some extra things that aren’t code style
errors, but are likely to introduce bugs or unexpected behaviour. A
good example is having mutable default function arguments, which get
shared between every call to the function and therefore mutating a value
in one place can unexpectedly cause it to change in another.
This commit enables all the extra warnings provided by Flake8 Bugbear,
except for:
- the line length one (because we already lint for that separately)
- B903 Data class should either be immutable or use `__slots__` because
this seems to false-positive on some of our custom exceptions
- B902 Invalid first argument 'cls' used for instance method because
some SQLAlchemy decorators (eg `declared_attr`) make things that
aren’t formally class methods take a class not an instance as their
first argument
It disables:
- _B306: BaseException.message is removed in Python 3_ because I think
our exceptions have a custom structure that means the `.message`
attribute is still present
Matches the work done in other repos:
- https://github.com/alphagov/notifications-admin/pull/3172/files
The request_id is set is the task is created from a http request, if that task then calls through to another task this will set the request_id from the global context. We should then be able to follow the creation of a notification all the way from the original http request to the sending task.
We have been getting log lines of the following:
`API POST request on
https://api.notifications.service.gov.uk/notifications/sms/mmg failed
with None`
It's not clear what error caused the request to fail because the value
of `api_error.response` is always `None`.
There appears to be something wrong with this logging.
`raise_for_status` will raise an `HTTPError`, so then there should be no
reason to then pass that error into another `HTTPError` (which is
causing the response to be lost).
We can instead simply catch the `HTTPError` and log it's status
code.
This might not be perfect, but it's definitely an improvement and should
give us some more context about why these requests occasionally fail.