* notify_db fixture creates the database connection and ensures the test
db exists and has migrations applied etc. It will run once per session
(test run).
* notify_db_session fixture runs after your test finishes and deletes
all non static (eg type table) data.
In unit tests that hit the database (ie: most of them), 99% of the time
we will need to use notify_db_session to ensure everything is reset. The
only time we don't need to use it is when we're querying things such as
"ensure get X works when database is empty". This is such a low
percentage of tests that it's easier for us to just use
notify_db_session every time, and ensure that all our tests run much
more consistently, at the cost of a small bit of performance when
running tests.
We used to use notify_db to access the session object for manually
adding, committing, etc. To dissuade usage of that fixture I've moved
that to the `notify_db_session`. I've then removed all uses of notify_db
that I could find in the codebase.
As a note, if you're writing a test that uses a `sample_x` fixture, all
of those fixtures rely on notify_db_session so you'll get the teardown
functionality for free. If you're just calling eg `create_x` db.py
functions, then you'll need to make you add notify_db_session fixture to
your test, even if you aren't manually accessing the session.
Previously these metrics weren't very useful because they could be
skewed by long timings for failed notifications, which can take up
to 72 hours to deliver. I'm intentionally not trying to have a dual
running period (with the old and new names) because:
- We don't use the current stats for anything (checking Grafana).
- The current stats get turned into a "bucket" metric in Prometheus
[1][2], which isn't very useful because it can only tell us the mean
time to deliver, but we're actually interested in percentiles.
Switching to a new naming is an opportunity to fix the raw data and
the way it's aggregated, using the same kind of "summary" metric that
we now use for stats about our Celery tasks [3].
[1]: c330a8ac8a/paas/statsd/statsd-mapping.yml (L82)
[2]: https://prometheus.io/docs/practices/histograms/#quantiles
[3]: https://github.com/alphagov/notifications-aws/pull/890
At the moment we log everytime we get a bounce from SES, however we
don't link it to a particular notification so it's hard to know for what
sub reason a notifcation did not deliver by looking at the logs.
This commit changes this by now looking the bounce reason after we have
found the notification ID and including them together. So if you know
search for a notification ID in Kibana, you will see full logs for why
it failed to deliver.
It is possible a service has data rention that is smaller than the time it takes to get a delivery receipt.
This PR refactors process_ses_receipt to update NotificationHistory if the Notifcation has already been purged.
The timestamps available in the SES receipt don't always correspond
to the time the notification has been sent. We've seen callbacks with
a current timestamp in both 'mail' and 'bounce' objects that referenced
a notification sent a week ago, which means we can't rely on it to skip
archived notifications.
One possible approach would be to look up the notification reference in
the notification_history table, but this goes against our plans to stop
relying on it in the future.
This changes the SES receipts logic to retry missing notifications once
(if the callback timestamp is within the last 5 minutes the task will
retry after a 5 minute delay) to capture callbacks arriving before the
notification reference has been persisted to the DB. Otherwise, we log
the missing notification as a warning instead of error.
We've seen errors caused by what we suspect is a race condition when
SES callback processing tries to look up the notification before the
sender worker has saved notification reference from the SES POST
response to the database.
This adds a retry for SES callback task if the notification was not
found and the message is less than 10 minutes old and removes the
error log message for notifications older than 3 days (since they
might no longer exist in the notifications table and would've been
marked as failure by then either way).
In order to be able to call retry and silence the error log based on
notification time this change inlines `process_ses_response` and
`update_notification_by_reference` functions into the celery task.
It also removes a lot of defensive error-handling that doesn't appear
to have been triggered in the last few months (for things like missing
keys in SES callback data).
The json we were getting from SES was not quite as expected, the test data now reflects what we get.
New test added, fix a test that was passing regardless.
When handling the complaint we don't want to throw an exception if the message is missing fields. Only log an exception if we are unable to tie a complaint to a notification.