As it turns out the SNS topic is trigger more that once when a file is placed in S3, this is caused by a bug in the s3ftp software used to mount the S3 bucket to the FTP server.
S3ftp performs the create file operation more than once. This is resulting in duplicate counts of DailySortedLetter (the counts of how many letters marked as sorted or unsorted, which affect how much the provider will charge Notify for the letter).
This PR adds a new column to DailySortedLetter called file_name. A new unique constraint on billing_day + file_name is added. Each time we write a row to the table the counts are over written rather than aggregated.
I am aware that this PR is not backwards compatiable. However, since the code is typically triggered once a day around 13:00 then it is very unlikely an exception will occur during the deploy. Also a complete migration of the data based on all our response files on S3 will be performed soon, meaning the data will be corrected. Also if an exception does occur it is after the updates to notification status has already occurred.
We need to deal with this, it's ok when updating a notification status from delivered to delivered. But the DailySortedLetter counts are being doubled.
Adding the file_name to the table as a unique key to get around this issue. It will mean we have multiple rows for each billing_day, but that's ok we can aggregate that.
This will also give us a way to see which file created which count.
SQLAlchemy handles escaping anything that could allow a SQL injection
attack. But it doesn’t escape the characters used for wildcard
searching. This is the reason we’re able to do `.like('%example%')`
at all.
But we shouldn’t be letting our users search with wildcard characters,
so we need to escape them. Which is what this commit does.
We were previously expecting the letter response files to be in the
format of 'NOTIFY.<datetime>.RSP.TXT' but the response files we receive
use '-' in the filenames instead of '.' which was causing an error when
we tried to get the date from the filename.
In the 'update-letter-notifications-statuses' task we want to ensure
that temporary failures are always handled, regardless of whether the
response file we receive contains unknown Sorted statuses or not.
Added the following new notification statuses:
* pending_virus_check
* virus_scan_failed
If we decide to remove these statuses in future, we will need to replace
them with a different status in the notifications and
Notification_history tables where they are referenced, so
pending-virus-check will be replaced with sending, and virus-scan-failed
will be replaced with permanent-failure.
This was overly optimistic that 2G would be enough to handle 4 worker
processes as they are already exhausting the 2G limit.
Depending on performance we may need to tweak the memory instead/too.
When the notification is timedout by the scheduled task if the service is expecting a status update, that update to the service would fail.
A test has been added.
* Removed extra log messages so there are not two log messages being
generated per exception, as InvalidRequest also logs, updated the
InvalidRequest log message to include the exception type and exception
information
* Added extra asserts to ensure the exception messages are printed
* Deleted the statistics DAO
(this was used for the job statistics tasks)
* Deleted the functions in the jobs DAO which are no longer used
(the functions that were used for the job-stats endpoints)
The tasks are no longer being used, so can be deleted safely:
* record_initial_job_statistics
* record_outcome_job_statistics
* timeout-job-statistics
The test file for the statistics tasks was deleted in a previous commit.
process_incomplete_jobs loops through jobs processing them in a single
task. This means that if the job statuses are all 'error', and then the
process_incomplete_jobs task fails, the later jobs in the list that
never got picked up won't have their status set back to in progress, or
their processing_started time - so will be stuck in 'error' forever.
Instead, we set the job statuses to in progress and the start time to
now before we process any - so if the incomplete_jobs task fails later,
the jobs will be picked up (again) by the check_job_statuses task in
half an hour's time
the process_incomplete_jobs task runs through all incomplete jobs in
a loop, so it might not get a chance to update the processing_started
time of the last job before check_job_status runs again (every minute).
So before we even trigger the process_incomplete_jobs task, lets set
the status of the jobs to error, so that we don't identify them for
re-processing again.
we might stop processing jobs mid-way through if, for example, a
deploy or downscale kills the box working on it. We have a scheduled
task that identifies any job that we started processing more than half
an hour ago that is still processing.
However, we encountered a bug where we triggered the
process_incomplete_job multiple times, because the processing_started
of the job was still set to half an hour ago. If we reset the
processing_started to the current time, then it won't get picked up by
future runs of the check_job_status scheduled task.
before sending it to template preview. This stops the whole pdf file
being sent to template preview for each page which is really inefficient
on network traffic and memory usage.
* Added logic to the endpoint to extract the specific page requested
* Updated tests to add a mock for the new call to utils
* Added a new test case for exceptions in the PDF extraction process
The problem has been resolved but we need to replay the messages that are missing. We have been sent a file containing client_references for all the notificaitons that the service would needs updates for.