Currently "test_send_letter_notification_via_api" fails at the final
stage in create-fake-letter-response-file [^1]:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6011): Max retries exceeded with url: /notifications/letter/dvla (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff95ffc460>: Failed to establish a new connection: [Errno 111] Connection refused'))
This only applies when running in Docker so the default should still
be "localhost" for the Flask app itself.
[^1]: 5093064533/app/celery/research_mode_tasks.py (L57)
It is currently 60 seconds but we have had two incidents in the
past week where there is a connection error talking to a service
and the request takes up to 60 seconds before failing. When this
happens, if there are a few of these callbacks then all of them
will completely hog the service callback worker and build up a big
queue of all the other service callbacks.
5 seconds has been chosen as that is still a pretty decent length
time for a simple web request that should just be giving them a
little bit of information for them to store. 5 seconds should be a
sufficient enough reduction that we dramatically reduce this problem
for the moment.
Open to this number
being changed in the future based on how we see it perform.
Since sept 2019 we've had to log on to production around once every
twenty days to restart the virus scan task for a letter. Most of the
time this is just a case of making sure the file is in the scan bucket,
and then triggering the task. If the file isn't in the scan bucket we'd
need to do some more manual investigation to find out exactly where the
file got stuck, but I can only remember times when it's been in the scan
bucket.
So if the file is in the scan bucket, we can just check that with code
and kick the task off automatically.
This follows the pattern for invite emails where the admin app tells the
API which domain to use when generating the link.
This will starting working once the admin change is merged:
- [ ] TBC
It won’t break anything if it’s merged before the admin change.
Daily volumes report: total volumes across the platform aggregated by whole business day (bst_date)
Volumes by service report: total volumes per service aggregated by the date range given.
NB: start and end dates are inclusive
if we have too many returned letters, we'll exceed SQS's max task size
of 256kb. Cap it to 5000 - this is probably a bit conservative but
follows the initial values we used when implementing this for the
collate-letters-task[^1]. Also follow the pattern of compressing the
sqs payload just to reduce it a little more.
[^1]: https://github.com/alphagov/notifications-api/pull/1536
We had an inbound number in the database with a value of ''. This
could happen if there are blank lines in the inbound numbers file
we use for the `insert-inbound-numbers` command. To avoid this
happening again, the command now calls `.strip()` on each line of the
file and only inserts a row if the result is truthy (i.e. not '').
- second class postage will go up by 2 pence, plus VAT
- international postage will go up by 7 pence, plus VAT
- first class postage will go down by 6 pence, plus VAT
The test was querying `FactNotificationStatus` and ordering the results
by bst_date and notification_type then checking the rows. However, the
bst_date and notification_type for each row is the same, so this test
could fail based on the order that the results came back in. By ordering
on the notification_status instead, we can be sure of the order of the
results.
This changes the scheduled task to raise an alert if letters are still
sending from 1530 to 1700. DVLA have reported that our "monitoring is
executing just before we actually mark them as ‘despatched’ and send
you the feedback files." and asked us to make the check a little later.
We don't actually contact DVLA until the morning after the alert anyway,
so this won't affect the process of getting in touch with them.
This change will require Cronitor to be updated for the new time.
Currently we alert if a service wastes £16 of SMS. It may cost us
around that amount just to deal with the alert, especially if the
service refuses to clean up their data.
This bumps the threshold to something more alarming, which should
make it more reasonable to suspend the service if we can show that
they've already wasted public money. £160 seems like a reasonable
compromise between have wasted vs could waste.
Note: we previously compromised on 1000 [1] down from 63K [2]. I
think we can afford to go a little bit higher.
[1]: https://github.com/alphagov/notifications-api/pull/3234
[2]: https://github.com/alphagov/notifications-api/pull/3221
This makes a few changes to:
- Make local development consistent with our other apps. It's now
faster to start Celery locally since we don't try to build the
image each time - this is usually quick, but unnecessary.
- Add support for connecting to a local Redis instance. Note that
the previous suggestion of "REDIS = True" was incorrect as this
would be turned into the literal string "True".
I've also co-located and extended the recipes in the Makefile to
make them a bit more visible.
gunicorn doesn't pin eventlet, but functionally, gunicorn==20.1.0
depends on eventlet<=0.30.2 due to a change in eventlet. Gunicorn have
fixed this compat issue, however, haven't released it. By pinning to a
git commit, we're able to bump eventlet up to 0.33, thus solving a
security advisory. (Note that the security advisory didn't actually
impact us as it only affects websockets, however, it was noisy and
distracting).
Note - pip may have cached the old version of gunicorn. You may need to
run `pip install -r requirements.txt --no-cache-dir` to get the updated
version of gunicorn locally.
Before we implemented ‘cancel’ any updates to a broadcast went through
the admin app. This meant the admin app could deal with clearing the
cache any time a broadcast was updated by a user performing an action.
Now that a broadcast can be updated without the admin app being involved
we have another place we need to clear the cache from.
If we don’t do this then the broadcast can look like it’s still going
even though it’s successfully been cancelled.
With the changes we will be making to email branding in admin it's
useful to know the ID of the NHS email brand. This makes it easier to
show a preview of the branding.
This means that the ID will need to be consistent across environments,
so this changes the details of the NHS brand in the dev, preview and
staging environments to match the production data.
The migration
- Removes NHS branding from existing services
- Removes NHS branding from any orgs which had it as their default
- Deletes the NHS branding row if it exists
- Inserts a new NHS branding row with details which match those in
production
This does have the effect of removing NHS branding from people's local
and preview services and orgs (NHS branding doesn't exist on staging),
but this does not affect real teams. The NHS logo with a filename of
'1ac6f483-3105-4c9e-9017-dd7fb2752c44-nhs-blue_x2.png' exists in the
logo S3 bucket for all environments.
Because the `<reference>` field of a `cancel` message can contain an
arbitrary number of items it’s possible for it to reference more than
one current alert.
In this case it is ambiguous which alert should be cancelled, so we
should raise a custom error.
This will help people know that they have to manually go into Notify and
figure out which alert(s) to cancel there.
It is possible that, among the references Environment Agency give us for
which broadcast to cancel, there could be references for older, already
expired broadcasts.
This would be the case if someone cancelled a broadcast in Notify, then
issued and try to re-cancel another broadcast to the same area. The
Flood Warning Service has no way of knowing that the first broadcast has
been cancelled in Notify already, so it would add the reference to the
list of things to be cancelled.
We can avoid this from happening by filtering-out already-cancelled and
expired broadcasts before looking up which one should be cancelled.