The assumption was that S3 would throw an exception if the object was uploaded twice. That's not the case the default behaviour is that if a file already exists it will be overwritten. So it is completely safe to run the task from the alert.
It can also mean that we don't need to wait 4hours 15 minutes. Shall I decease the amount of time before restarting the task?
The nightly job to delete email notifications was failing because it was
timing out (`psycopg2.errors.QueryCanceled: canceling statement due to statement timeout`).
This adds a query limit to the query which inserts or updates
notification history so that it only updates a maximum of 10000 rows at
a time.
All our endpoint should perform a check that the params are valid - this is an easy whay to check that and is standard for our endpoints.
I reverted the query to just filter by job id.
Now we consistently use the created_at date, so we can always get the right file location and name.
The previous updates to this code were trying to solve the problem if a pdf being created at 17:29, but not ready to upload until 17:31 after the antivirus and validation check.
But in those cases we would have trouble finding the file.
When we cancel a job, we need to check if all notifications are
already in the database. So far, we were querying for all
notification objects in the database and counting them in
admin app, which runs into pagination problems for large jobs,
and could time out for very large jobs.
Now looking at the updated_at date, we are getting the alert if the notification was created_at:17:29 updated to created status at 17:30, so the letter is in the next days bucket.
Not sure if I want to make this change, there isn't an index on updated_at, so the query might be slow.
* The `_should_record_notification_in_history_table` function stopped being
used in this commit: c23ae15f32
* `NOTIFICATIONS_ALERT` stopped being used in this commit: 5aa37f09b6
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
Added a task which runs twice a day on weekdays and checks for letters that have
been in the state of `pending-virus-check` for over 90 minutes. This is
just logging an exception for now, not trying to fix things, since we
will need to manually check where the issue was.
Update subquery to run again but for test keys. Test data is never inserted in Notifications so they need to be deleted separately now given the join to NotificationHistory.
This PR adds a function to upsert (insert or update if exists) NotificationHistory all the rows from Notification that we are about to delete in the nightly task. This will happen just before the delete function. Since it is a upsert query the function can be called more than once.
This should allow us remove all the insert/updates to NotificationHistory.
However, there is a consern that this will double the length of time the tasks take. So do we do these upserts in a separate task or in the same one?
provider switching is a process that can happen as often as we like
without disrupting the flow of the system - however, there are some
reasons why we might not want to switch. One problem we've seen is
when a provider is having an issue, we might switch away from them
manually only for the app to automatically switch back to them again
and again.
Long term we'd like to have a system better suited for sharing the load
equally between our two sms providers, but short term, by increasing
the threshold for switching from 10% (of messages sent are slow) to
20%, we hope to make switching happen less often.
A notification is considered slow if it was sent in the last ten
minutes, on the current provider, and is either
* still in sending or pending after 4 minutes
* in delivered, but took at least 4 minutes to send
Currently we switch if:
* status = delivered and updated_at - sent_at > threshold
* status = sending and now - sent_at > threshold
firetext can leave notifications in the pending state, which is
equivalent to sending in terms of how we should handle it, so this
commit changes the second case to allow pending as well as sending.
When we get a callback from SES, we identify the notification by the
SES reference that we set on the notification after sending. When we
wrote the log message, we assumed that we'd always have a notification
for every callback, so if one couldn't be found we would raise an error
log. This isn't the case for a few reasons:
* We might receive a callback before the sender worker has persisted
the reference to the database.
* We might have deleted the notification, especially if the service has
a short data retention period
* We sometimes receive callbacks for references that we have no record
of whatsoever (this is quite alarming but we have no way of knowing
why this happens)
The error logs were happening pretty frequently, and we don't have a
real way to solve them at the moment, so lets cut down on noise and
downgrade them to info level for now.
Flask-SQLAlchemy paginate function issues a separate query to get
the total count of rows for a given filter. This query (with
filters used by the API integration Message log page) is slow for
services with large number of notifications.
Since Message log page doesn't actually allow users to paginate
through the response (it only shows the last 50 messages) we can
use limit instead of paginate, which requires passing in another
flag from admin to the dao method.
`count` flag has been added to `paginate` in March 2018, however
there was no release of flask-sqlalchemy since then, so we need
to pull the dev version of the package from Github.
Previously, we logged a warning containing the notification reference
and new status. However it wasn't a great message - this new one
includes the notification id, the old status, the time difference and
more.
This separates out logs for callbacks for notifications we don't know
(error level) and duplicates (info level).
We are adding an index to Notifications to optimize the get_notifications_for_service. We need to build the index concurrently which can not be run inside a transaction block so the index will need to be run on the db directly.
CREATE INDEX CONCURRENTLY ix_notifications_service_created_at ON notifications (service_id, created_at);
DROP INDEX CONCURRENTLY ix_notifications_service_created_at
The delivery for provider is slow if more than threshold (currently
we pass in threshold 10%) either took x (for now 4) minutes to deliver,
or are still sending after that time. We look at all notifications
for current provider which are delivered or sending, and are not under
test key, for the last 10 minutes.
We are using created_at to establish if notifications are from last
10 minutes because we have an index on it, so the query is faster.
Also write tests for new is_delivery_slow_for_provider query
It can be useful to get a notification by id while checking that the
notification belongs to a given service. This changes the
get_notification_by_id DAO function to optionally also filter by
service_id so that we can check this.
Bumped notifications-utils to 3.7.0. Version 3.7.0 includes the
`convert_utc_to_bst` and `convert_bst_to_utc` functions and the
`LETTER_PROCESSING_DEADLINE` constant, so these have been removed from
this repo and anywhere using these has now been updated to get these
from `notifications-utils`.
Also bumped pytest by a patch version to bring in a bug fix.