insert/update, and then delete notifications in hourly batches. This
means that if the task gets interrupted part-way through, we'll have at
least something to show for it. Previously we would insert and update
into the history table but might not delete from the notification table
properly.
Keeping the offsets and limits for confidence around reliability and
queries timing out.
Keeping the join to notification_history to ensure we don't delete
anything prematurely while our DB is in a bit of a weird state with lots
of these tasks failing over the last week.
Previously, when running the `collate_letter_pdfs_for_day` task, we
would only send letters that were created between 5:30pm yesterday and
5:30 today.
Now we send letters that were created before 5:30pm today and that are
still waiting to be sent. This will help us automatically attempt to
send letters that may have fallen through the gaps and not been sent the
previous day when they should have been.
Previously we solved the problem of letters that had fallen the gap by
having to run the task with a date parameter for example
`collate_letter_pdfs_for_day('2020-02-18'). We no longer need this date
parameter as we will always look back across previous days too for
letters that still need sending.
Note, we have to change from using the pagination `list_objects_v2` to
instead getting each individual notification from s3. We reduce load by
using `HEAD` rather than `GET` but this will still greatly increase the
number of API calls. We acknowledge there will be a small cost to this,
say 50p for 5000 letters and think this is tolerable. Boto3 also handles
retries itself so if when making one of the many HEAD requests, there is
a networking blip then it should be retried automatically for us.
We use boto3 for our interaction with s3. Therefore if an expection is
thrown it will be thrown from the botocore library (which boto3 is built
on top of).
I have copied
app/aws/s3.py::file_exists for an example of this exception catching.
The date in the notifications table should always be the most recent date for the template.
Removed the template_type param for the query as well.
Simplified the tests.
The existing endpoint returned a whole notification for the last time the template was used. But this only takes into account data in the last week. This new methods allows us to be specific about when the template was last used if ever but looking into the ft_notification_status table as well.
Before the search term was either:
- an email address (or partial email address)
- a phone number (or partial phone number)
Now it can also be:
- a reference (or partial reference)
We can take a pretty good guess, by looking at the search term, whether
the thing the user is searching by email address or phone number. This
helps us:
- only show relevant notifications
- normalise the search term to give the best chance of matching what we
store in the `normalised_to` field
However we can’t look at a search term and guess whether it’s a
reference, because a reference could take any format. Therefore if the
user hasn’t told us what kind of thing their search term is, we should
stop trying to guess.
We have a team who want to find emails that might have been sent to an
incorrect address. Therefore they can’t search by the correct address,
because it won’t match.
What they do have is the reference number of the user’s application,
which is also stored in the `client_reference` field on the
notification.
So when a user is searching we should also look at the client reference,
as well as the recipient, allowing the user to enter either in the
search box.
The assumption was that S3 would throw an exception if the object was uploaded twice. That's not the case the default behaviour is that if a file already exists it will be overwritten. So it is completely safe to run the task from the alert.
It can also mean that we don't need to wait 4hours 15 minutes. Shall I decease the amount of time before restarting the task?
The nightly job to delete email notifications was failing because it was
timing out (`psycopg2.errors.QueryCanceled: canceling statement due to statement timeout`).
This adds a query limit to the query which inserts or updates
notification history so that it only updates a maximum of 10000 rows at
a time.
All our endpoint should perform a check that the params are valid - this is an easy whay to check that and is standard for our endpoints.
I reverted the query to just filter by job id.
Now we consistently use the created_at date, so we can always get the right file location and name.
The previous updates to this code were trying to solve the problem if a pdf being created at 17:29, but not ready to upload until 17:31 after the antivirus and validation check.
But in those cases we would have trouble finding the file.
When we cancel a job, we need to check if all notifications are
already in the database. So far, we were querying for all
notification objects in the database and counting them in
admin app, which runs into pagination problems for large jobs,
and could time out for very large jobs.
Now looking at the updated_at date, we are getting the alert if the notification was created_at:17:29 updated to created status at 17:30, so the letter is in the next days bucket.
Not sure if I want to make this change, there isn't an index on updated_at, so the query might be slow.
* The `_should_record_notification_in_history_table` function stopped being
used in this commit: c23ae15f32
* `NOTIFICATIONS_ALERT` stopped being used in this commit: 5aa37f09b6
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
Added a task which runs twice a day on weekdays and checks for letters that have
been in the state of `pending-virus-check` for over 90 minutes. This is
just logging an exception for now, not trying to fix things, since we
will need to manually check where the issue was.
Update subquery to run again but for test keys. Test data is never inserted in Notifications so they need to be deleted separately now given the join to NotificationHistory.
This PR adds a function to upsert (insert or update if exists) NotificationHistory all the rows from Notification that we are about to delete in the nightly task. This will happen just before the delete function. Since it is a upsert query the function can be called more than once.
This should allow us remove all the insert/updates to NotificationHistory.
However, there is a consern that this will double the length of time the tasks take. So do we do these upserts in a separate task or in the same one?
provider switching is a process that can happen as often as we like
without disrupting the flow of the system - however, there are some
reasons why we might not want to switch. One problem we've seen is
when a provider is having an issue, we might switch away from them
manually only for the app to automatically switch back to them again
and again.
Long term we'd like to have a system better suited for sharing the load
equally between our two sms providers, but short term, by increasing
the threshold for switching from 10% (of messages sent are slow) to
20%, we hope to make switching happen less often.
A notification is considered slow if it was sent in the last ten
minutes, on the current provider, and is either
* still in sending or pending after 4 minutes
* in delivered, but took at least 4 minutes to send
Currently we switch if:
* status = delivered and updated_at - sent_at > threshold
* status = sending and now - sent_at > threshold
firetext can leave notifications in the pending state, which is
equivalent to sending in terms of how we should handle it, so this
commit changes the second case to allow pending as well as sending.
When we get a callback from SES, we identify the notification by the
SES reference that we set on the notification after sending. When we
wrote the log message, we assumed that we'd always have a notification
for every callback, so if one couldn't be found we would raise an error
log. This isn't the case for a few reasons:
* We might receive a callback before the sender worker has persisted
the reference to the database.
* We might have deleted the notification, especially if the service has
a short data retention period
* We sometimes receive callbacks for references that we have no record
of whatsoever (this is quite alarming but we have no way of knowing
why this happens)
The error logs were happening pretty frequently, and we don't have a
real way to solve them at the moment, so lets cut down on noise and
downgrade them to info level for now.
Flask-SQLAlchemy paginate function issues a separate query to get
the total count of rows for a given filter. This query (with
filters used by the API integration Message log page) is slow for
services with large number of notifications.
Since Message log page doesn't actually allow users to paginate
through the response (it only shows the last 50 messages) we can
use limit instead of paginate, which requires passing in another
flag from admin to the dao method.
`count` flag has been added to `paginate` in March 2018, however
there was no release of flask-sqlalchemy since then, so we need
to pull the dev version of the package from Github.
Previously, we logged a warning containing the notification reference
and new status. However it wasn't a great message - this new one
includes the notification id, the old status, the time difference and
more.
This separates out logs for callbacks for notifications we don't know
(error level) and duplicates (info level).