Now looking at the updated_at date, we are getting the alert if the notification was created_at:17:29 updated to created status at 17:30, so the letter is in the next days bucket.
Not sure if I want to make this change, there isn't an index on updated_at, so the query might be slow.
* The `_should_record_notification_in_history_table` function stopped being
used in this commit: c23ae15f32
* `NOTIFICATIONS_ALERT` stopped being used in this commit: 5aa37f09b6
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
Added a task which runs twice a day on weekdays and checks for letters that have
been in the state of `pending-virus-check` for over 90 minutes. This is
just logging an exception for now, not trying to fix things, since we
will need to manually check where the issue was.
Update subquery to run again but for test keys. Test data is never inserted in Notifications so they need to be deleted separately now given the join to NotificationHistory.
This PR adds a function to upsert (insert or update if exists) NotificationHistory all the rows from Notification that we are about to delete in the nightly task. This will happen just before the delete function. Since it is a upsert query the function can be called more than once.
This should allow us remove all the insert/updates to NotificationHistory.
However, there is a consern that this will double the length of time the tasks take. So do we do these upserts in a separate task or in the same one?
provider switching is a process that can happen as often as we like
without disrupting the flow of the system - however, there are some
reasons why we might not want to switch. One problem we've seen is
when a provider is having an issue, we might switch away from them
manually only for the app to automatically switch back to them again
and again.
Long term we'd like to have a system better suited for sharing the load
equally between our two sms providers, but short term, by increasing
the threshold for switching from 10% (of messages sent are slow) to
20%, we hope to make switching happen less often.
A notification is considered slow if it was sent in the last ten
minutes, on the current provider, and is either
* still in sending or pending after 4 minutes
* in delivered, but took at least 4 minutes to send
Currently we switch if:
* status = delivered and updated_at - sent_at > threshold
* status = sending and now - sent_at > threshold
firetext can leave notifications in the pending state, which is
equivalent to sending in terms of how we should handle it, so this
commit changes the second case to allow pending as well as sending.
When we get a callback from SES, we identify the notification by the
SES reference that we set on the notification after sending. When we
wrote the log message, we assumed that we'd always have a notification
for every callback, so if one couldn't be found we would raise an error
log. This isn't the case for a few reasons:
* We might receive a callback before the sender worker has persisted
the reference to the database.
* We might have deleted the notification, especially if the service has
a short data retention period
* We sometimes receive callbacks for references that we have no record
of whatsoever (this is quite alarming but we have no way of knowing
why this happens)
The error logs were happening pretty frequently, and we don't have a
real way to solve them at the moment, so lets cut down on noise and
downgrade them to info level for now.
Flask-SQLAlchemy paginate function issues a separate query to get
the total count of rows for a given filter. This query (with
filters used by the API integration Message log page) is slow for
services with large number of notifications.
Since Message log page doesn't actually allow users to paginate
through the response (it only shows the last 50 messages) we can
use limit instead of paginate, which requires passing in another
flag from admin to the dao method.
`count` flag has been added to `paginate` in March 2018, however
there was no release of flask-sqlalchemy since then, so we need
to pull the dev version of the package from Github.
Previously, we logged a warning containing the notification reference
and new status. However it wasn't a great message - this new one
includes the notification id, the old status, the time difference and
more.
This separates out logs for callbacks for notifications we don't know
(error level) and duplicates (info level).
We are adding an index to Notifications to optimize the get_notifications_for_service. We need to build the index concurrently which can not be run inside a transaction block so the index will need to be run on the db directly.
CREATE INDEX CONCURRENTLY ix_notifications_service_created_at ON notifications (service_id, created_at);
DROP INDEX CONCURRENTLY ix_notifications_service_created_at
The delivery for provider is slow if more than threshold (currently
we pass in threshold 10%) either took x (for now 4) minutes to deliver,
or are still sending after that time. We look at all notifications
for current provider which are delivered or sending, and are not under
test key, for the last 10 minutes.
We are using created_at to establish if notifications are from last
10 minutes because we have an index on it, so the query is faster.
Also write tests for new is_delivery_slow_for_provider query
It can be useful to get a notification by id while checking that the
notification belongs to a given service. This changes the
get_notification_by_id DAO function to optionally also filter by
service_id so that we can check this.
Bumped notifications-utils to 3.7.0. Version 3.7.0 includes the
`convert_utc_to_bst` and `convert_bst_to_utc` functions and the
`LETTER_PROCESSING_DEADLINE` constant, so these have been removed from
this repo and anywhere using these has now been updated to get these
from `notifications-utils`.
Also bumped pytest by a patch version to bring in a bug fix.
- Updated notifications_dao.update_notification_status_by_id with an optional parameter to set the sent_by, this will eliminate a separate update to notifcaitons.
- Added the callback url to the log message, that way we can see if it's the same url failing.
- Stop sending the status callbacks for PENDING status.
There was a datetime bug in the query which resulted in files not being sent to the postal provider.
The trigger-letter-pdfs-for-day task is no longer needed, so rather than fix the query just call collate_letter_pdfs_for_day directly.
Less code is always better.
Deployment considerations: I realized this is strictly not backwards compatible if the scheduled job is in progress and a task is on the queue that no longer exists. This is ok since we will deploy this well before 17:50.
For returned letter updates most notifications won't exist in the
notifications table, so in order to find out whether the reference
matches any known letters we need to check the count of updated
history records.
If there are no files to delete we won't get an excpetion.
Wrap the delete file in a try/except to avoid stopping the entire task.
Fix the missing slash for the file name.