Before we had a long back off, now we have more, but shorter backoffs.
- PREVIOUS
When we had an error talking to a provider we retried quickly and if we still got errors we backed off more and more. Maximum attempts was 5, max delay 4hours. This was to allow us time to ship a build if that was required.
- NOW
Backing off 48 times of 5 minutes each. This gives us the same total backoff, but many more tries in that period.
- WHY
Having the long back off meant messages could be delayed 4 hours. This was happening more and more, as PaaS deploys can place things into the "inflight" state in SQS. The inflight state MUST have an expiry time LONGER than the maximum retry back off. This meant that messages would be delayed 4 hours, even when there was no app error.
By doing this we can reduce this delay to 5 minutes. Whilst still giving us time to fix issues.
- uses the reference field on the notifications table to store a 16char random string used to cross reference DVLA letters back to the notification
- used as letter barcode does not have space for a UUID notification id
Depends on https://github.com/alphagov/notifications-utils/pull/149
Renamed the numeric_id to notification_reference in utils and changed validation rules to match this
Note also the persist_notification method set "reference" to be "client_reference" which is confusing and they are different things, so fixed this too.
Problem: we were sending the first line of the address in the
`TO_NAME_2` field. This meant that they couldn’t do any PAF lookups with
it, because it wasn’t where they were expecting.
The first line of the address is the second line of what our users give
us. We need to give this to DVLA as the _third_ line of the address
output, which they call `TO_ADDRESS_LINE_1`.
- previously this was unbounded, so it got all jobs older then 7 days. In excess of 75,000 🔥
- this meant that the job took (a) a long time and (b) a lot memory and (c) doing the same thing every day
These changes mean that the job has a 2 day eligible window for jobs, minimising the number of eligible jobs in a run, whilst still retaining some leeway in event if it failing one night.
In principle the job runs early morning on a given day. The previous 7 days are left along, and then the previous 2 days worth of files are deleted:
so:
runs on
31st
30,29,28,27,26,25,24 are ignored
23,22 jobs here have files deleted
21 and earlier are ignored.
Whatever a user has entered for their service’s contact block should
appear in the right place in the file we give to DVLA.
The work to output in the right fields in the DVLA file has already been
done. We just weren’t passing it through. This commit passes it through.
Extends the test to make sure that the thing that builds each line of
the file is getting called with the right template, personalisation and
numeric ID. Will be helpful the more complicated the call to the
template gets.
when we made the change to async persist notifications, we forgot to
pass through api_key_id and key_type. in send_sms/email, for legacy
reasons, they default to None/KEY_TYPE_NORMAL, so regardless of what
your api key was set up as, we would send real messages!
TODO: Once the PaaS transition is complete and the task changes are
reverted, remove the api_key_id and key_type params from the send_*
tasks entirely, as those are only called from the csv job flow, and
don't need them