process_incomplete_jobs loops through jobs processing them in a single
task. This means that if the job statuses are all 'error', and then the
process_incomplete_jobs task fails, the later jobs in the list that
never got picked up won't have their status set back to in progress, or
their processing_started time - so will be stuck in 'error' forever.
Instead, we set the job statuses to in progress and the start time to
now before we process any - so if the incomplete_jobs task fails later,
the jobs will be picked up (again) by the check_job_statuses task in
half an hour's time
the process_incomplete_jobs task runs through all incomplete jobs in
a loop, so it might not get a chance to update the processing_started
time of the last job before check_job_status runs again (every minute).
So before we even trigger the process_incomplete_jobs task, lets set
the status of the jobs to error, so that we don't identify them for
re-processing again.
we might stop processing jobs mid-way through if, for example, a
deploy or downscale kills the box working on it. We have a scheduled
task that identifies any job that we started processing more than half
an hour ago that is still processing.
However, we encountered a bug where we triggered the
process_incomplete_job multiple times, because the processing_started
of the job was still set to half an hour ago. If we reset the
processing_started to the current time, then it won't get picked up by
future runs of the check_job_status scheduled task.
The problem has been resolved but we need to replay the messages that are missing. We have been sent a file containing client_references for all the notificaitons that the service would needs updates for.
It is possible to search for a phone number when from the email notification page and get a SMS message in return.
This also helps to optimise the query.
Phone numbers sometimes contain stuff we normalise out. This matches
perfectly if we have a full phone number, because we can normalise the
thing we’re searching for in the same way as the search term.
With partial search terms we can’t do this completely, because we can’t
work out if ‘123’ is part of a UK number, an international number, the
start of the phone number, the last 3 digits, etc.
What we can do is remove some stuff that we can know will cause partial
search terms to not match:
- leading pluses
- leading `0`s
- any brackets
- any spaces
Users expect the search to work on partial email addresses ‘similar to
Gov.Pay’. We can’t have a situation where Pay are doing something better
than us 😜
Which means we can remove the need to request the data from the database.
In order for the PR to be backwards compatible I have added an optional parameter "encrypted_status_update".
If this is not None then the new code is called.
The next PR will send the encrypted data to this task.
A final PR will remove the code that uses the database to get the notification and service callback api.
We have seen problems with the service callback workers due to the db
connection pool being exhausted. When the worker picks up the task, it
makes a db query to get the notification, a query to get the callback
url, and then closes the session before it makes the 3rd party request.
However, even closing the session before the (potentially lengthy)
web request wasn't enough - we've seen significant amounts of
`sqlalchemy.exc.TimeoutError`s.
This reverts commit 2dfbd93c7e
The JobStatistics table is going to be deleted. There are currently
3 tasks which use the JobStatistics model via the Statistics DAO, so we
need to make sure that these tasks aren't being used before they are
deleted in a separate PR.
This commit deletes:
* The `create_initial_notification_statistic_tasks` function which gets
used to call the `record_initial_job_statistics` task.
* The `create_outcome_notification_statistic_tasks` function which gets
used to call the `record_outcome_job_statistics` task.
* And the scheduling of the `timeout-job-statistics` scheduled task.