This is because that error is caused by our providers and we
cannot do anything about it but it can make our logs hard to read
and actionable errors harder to spot
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
Added a task which runs twice a day on weekdays and checks for letters that have
been in the state of `pending-virus-check` for over 90 minutes. This is
just logging an exception for now, not trying to fix things, since we
will need to manually check where the issue was.
The `process_virus_scan_passed` task now catches S3 errors - if these
occur, it logs an exception and puts the letter in a `technical-failure`
state. We don't retry the task, because the most common reason for
failure would be the letter not being in the expected S3 bucket, in
which case retrying would make no difference.
the celery decorator should always be on the outside so that all other
decorators will be captured within the celery task. We had problems
with cronitor not reporting, and only for this task.
Bumped utils to version 31.2.5, which changes when the rows of a
RecipientCSV get created. Switched to using `.get_rows()` from
RecipientCSV (a generator) instead of the `.rows` property (which builds
a list of the rows in memory).
This celery task was not decorated with the cronitor decorator so never
checked in with cronitor.
Adding the decorator will ensure this task is monitored.
The requisite cronitor key is in the credentials repository already.
Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
When we send a zip file of letters to DVLA we expect them to send back an acknowledgement of those files.
Previously they named the files like NOTIFY.20180202091254.ACK.TXT and the contents would contain the name of the zip file we sent with a date of when they got it.
They have updated this format to mirror the format of the zip file because there was an instance where they sent 2 files of the same name so the later overwrote the first.
Since the name matches our name, there is no need to get the file from S3 but just compare file names.
the create_nightly_notification_status task runs at 00:30am UK time,
however this means that in summer datetime.today() will return the
wrong date as the server (which runs on UTC) will run the task at
23:30 (populating the wrong row in the table).
fix this to use nice tz aware functions
* call variables unambiguous things like `start_time` or `bst_date` to
reduce risk of passing in the wrong thing
* simplify the count_dict object - remove nested dict and start_date
fields as superfluous
* use static datetime objects in tests rather than calculating them
each time
Changed the query to get the performance platform stats from ft_notification_status. But the date used for the query needed to be a date, not datetime so the equality worked.
We need to back fill the daily_sorted_count tables, so we need to iterate through all the files. No need to update the notification status. So this task has been separated out.
if we partially retry a day, we would create new zip files, containing
different letters (if some were processed succesfully). We need these
files to have different filenames to earlier zip files so that we can
avoid overwriting log data in zips_sent.
Hashing the filename means that we'll only overwrite if it was the same
file containing the same content.
DVLA don't care about the naming conventions of zip files, other than
it must start with `NOTIFY.` and end with `.ZIP`. So lets format the
date in a more readable way, and separate it from the batch number
previously ftp would name the files itself by giving them a timestamp
when uploading. we ran into issues with tasks being picked up multiple
times and as such, uploading duplicate files. By naming the file before
creating the task, we can avoid this issue.
Files are now named `NOTIFY.YYYYMMDD######.ZIP` where the number is a
counter that increments with each task we've issued in that run of
collate-letter-pdfs-for-day
The timestamps available in the SES receipt don't always correspond
to the time the notification has been sent. We've seen callbacks with
a current timestamp in both 'mail' and 'bounce' objects that referenced
a notification sent a week ago, which means we can't rely on it to skip
archived notifications.
One possible approach would be to look up the notification reference in
the notification_history table, but this goes against our plans to stop
relying on it in the future.
This changes the SES receipts logic to retry missing notifications once
(if the callback timestamp is within the last 5 minutes the task will
retry after a 5 minute delay) to capture callbacks arriving before the
notification reference has been persisted to the DB. Otherwise, we log
the missing notification as a warning instead of error.
antivirus is sometimes tough to get running locally - now in dev
antivirus is skipped unless `ANTIVIRUS_ENABLED=1` is set on the command
line. on all other environments it is always enabled.
provider switching is a process that can happen as often as we like
without disrupting the flow of the system - however, there are some
reasons why we might not want to switch. One problem we've seen is
when a provider is having an issue, we might switch away from them
manually only for the app to automatically switch back to them again
and again.
Long term we'd like to have a system better suited for sharing the load
equally between our two sms providers, but short term, by increasing
the threshold for switching from 10% (of messages sent are slow) to
20%, we hope to make switching happen less often.
A notification is considered slow if it was sent in the last ten
minutes, on the current provider, and is either
* still in sending or pending after 4 minutes
* in delivered, but took at least 4 minutes to send
Celery `self.retry` raises an exception to communicate that the task
needs to be retried. Since our ses task is wrapped in a catch-all
except block it logs that exception as an error before retrying.
Handling Retry class separately allows us to raise it without logging
the traceback.
We've seen errors caused by what we suspect is a race condition when
SES callback processing tries to look up the notification before the
sender worker has saved notification reference from the SES POST
response to the database.
This adds a retry for SES callback task if the notification was not
found and the message is less than 10 minutes old and removes the
error log message for notifications older than 3 days (since they
might no longer exist in the notifications table and would've been
marked as failure by then either way).
In order to be able to call retry and silence the error log based on
notification time this change inlines `process_ses_response` and
`update_notification_by_reference` functions into the celery task.
It also removes a lot of defensive error-handling that doesn't appear
to have been triggered in the last few months (for things like missing
keys in SES callback data).
Currently we switch if:
* status = delivered and updated_at - sent_at > threshold
* status = sending and now - sent_at > threshold
firetext can leave notifications in the pending state, which is
equivalent to sending in terms of how we should handle it, so this
commit changes the second case to allow pending as well as sending.
The template preview app now accepts a null value for the `filename`
parameter. If a service doesn't have a letter branding option set,
previously we defaulted to their dvla_organisation (probably HM
Government). Now, we pass through None, so that we generate letters
without any logo or branding.
make a decorator that pings cronitor before and after each task run.
Designed for use with nightly tasks, so we have visibility if they
fail. We have a bunch of cronitor monitors set up - 5 character keys
that go into a URL that we then make a GET to with a self-explanatory
url path (run/fail/complete).
the cronitor URLs are defined in the credentials repo as a dictionary
of celery task names to URL slugs. If the name passed in to the
decorator isn't in that dict, it won't run.
to use it, all you need to do is call `@cronitor(my_task_name)`
instead of `@notify_celery.task`, and make sure that the task name and
the matching slug are included in the credentials repo (or locally,
json dumped and stored in the CRONITOR_KEYS environment variable)
If a precompiled letter can't be opened (e.g. because it isn't a valid
PDF) we were setting its billable units to 0, but not moving it to the
invalid PDF bucket. If a precompiled letter failed sanitisation, we were
moving it to the invalid PDF bucket but not setting its billable units
to 0.
This commit makes sure that we always set the billable units to 0
and move the PDF to the right bucket if it fails sanitisation or can't be
opened.
We want to send two new headers, ServiceId and NotificationId to the
template preview /precompiled/sanitise endpoint. This is to allow us to log
errors from this endpoint in template preview with all the information needed,
instead of needing to pass the information back to notifications-api and
to log it there.