Since we’ve doubled the number of rows in a job, jobs can take twice as
long to insert all the notifications. We don’t check for missing rows
until we’re pretty confident that the original tasks have finished
processing. This means we need to double the time we wait to still be
as sure.
The task was raising a JobIncompleteError, yet it's not an error the task is performing it's task correctly and calling the appropriate task to restart the job.
Also used apply_sync to create the task instead of send_task.
create-letters-pdf-tasks is for contacting template preview.
letters-tasks is for doing other things around letters (including putting things on template preview queue)
We were just ignoring the errors and our users were not fixing things.
Given that 500 texts cost approx £8 it's not the end of the world.
In the long run we may decide to just stop letting people try and send
messages to TV numbers but this is a quick fix to stop emails coming in
which we ignore.
We don't need to log this as an exception. It's not an exception, it's
behaviour that is not ideal but is still expected so therefore I've
changed it to warn. Also it removes the email we get for the exception
which is not needed as we get the zendesk ticket instead.
I've also fixed the multiline string meaning the link to the runbook is
included in the zendesk ticket.
Use .format instead of concatenation to avoid type issues
Trying to concatenate uuid onto a string was throwing an error.
Also it is not possible to use uuid in parametrize statements
it seems as it messes up with running tests on multiple threads
it's not acceptable for a constantly failing provider to take 50 minutes
to drain (5x reducing priority by 10). But similarly, we need _some_
delay, or a handful of concurrent failures will completely turn off a
provider, rendering the whole excercise kinda pointless. Setting the
delay before it tries to reduce priority again to one minute is nice
because it means that if one request times out and returns 502, then any
other requests that are in flight at that time will time out before the
one minute is up and not switch, but any requests made after the switch
that take sixty seconds to time out will affect it.
it now looks at both providers and works out whether to deprioritise
one, rather than binary switching from one to the other. If anything
has altered the priorities in the last ten minutes it won't take any
action. If both providers are slow it also won't take any action.
the function no longer makes sense now that we send through both at
the same time. mostly just used in old tests that we'll end up rewriting
shortly anyway
There is a chance that the there is an outstanding retry task that has yet to run but the task that are replayed here protect against the task running twice. So this just means it might get sent sooner than later.
The assumption was that S3 would throw an exception if the object was uploaded twice. That's not the case the default behaviour is that if a file already exists it will be overwritten. So it is completely safe to run the task from the alert.
It can also mean that we don't need to wait 4hours 15 minutes. Shall I decease the amount of time before restarting the task?
When we upload a CSV for a job, we add the sender_id as metadata to the file that is uploaded on S3.
There is more than one place where we process rows from that CSV.
- process_job
- scheduled_job
- check_for_missing_rows_in_completed_jobs
- check_job_status
All of these places need to use the sender_id, now the sender_id is always read from the file metadata.
In a subsequent PR we can remove the optional sender_id parameter from process_job task.
Sometimes a job finishes but has missed a row in the middle. It is a mystery why this is happening, it could be that the task to save the notifications has been dropped.
So until we solve the missing let's find missing rows and process them.
A new scheduled task has been added to find any "finished" jobs that do not have enough notifications created. If there are missing notifications the job processes those rows for the job.
Adding the new task to beat schedule will be done in the next commit.
A unique key constraint has been added to Notifications to ensure that the row is not added twice. Any index or constraint can affect performance, but this unique constraint should not affect it enough for us to notice.
Now looking at the updated_at date, we are getting the alert if the notification was created_at:17:29 updated to created status at 17:30, so the letter is in the next days bucket.
Not sure if I want to make this change, there isn't an index on updated_at, so the query might be slow.
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
Added a task which runs twice a day on weekdays and checks for letters that have
been in the state of `pending-virus-check` for over 90 minutes. This is
just logging an exception for now, not trying to fix things, since we
will need to manually check where the issue was.
We don't use FUNCTIONAL_TEST_PROVIDER_SERVICE_ID or
UNCTIONAL_TEST_PROVIDER_SMS_TEMPLATE_ID anymore so we can safely
delete them from config and tests.
Also test deleting jobs with flexible data retention
Also update tests for default data retention following logic
change: dao_get_jobs_older_than_data_retention now counts
today at the start of the day, not at a time when function runs
and updated tests reflect that
When we first built letters you could only send them via a CSV upload, initially we needed a way to send those files to dvla per job.
We since stopped using this page. So let's delete it!
There was a datetime bug in the query which resulted in files not being sent to the postal provider.
The trigger-letter-pdfs-for-day task is no longer needed, so rather than fix the query just call collate_letter_pdfs_for_day directly.
Less code is always better.
Deployment considerations: I realized this is strictly not backwards compatible if the scheduled job is in progress and a task is on the queue that no longer exists. This is ok since we will deploy this well before 17:50.