This is to support a migration from Redislabs to PaaS native Redis,
allowing us to toggle between old and new using the env vars for
the instance - without needing to change the code.
Fixes this error (in Admin):
File "/home/vcap/app/app/notify_client/notification_api_client.py", line 74, in send_precompiled_letter
return self.post(url='/service/{}/send-pdf-letter'.format(service_id), data=data)
File "/home/vcap/app/app/notify_client/__init__.py", line 59, in post
return super().post(*args, **kwargs)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_python_client/base.py", line 48, in post
return self.request("POST", url, data=data)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_python_client/base.py", line 64, in request
response = self._perform_request(method, url, kwargs)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_python_client/base.py", line 118, in _perform_request
raise api_error
notifications_python_client.errors.HTTPError: 500 - Internal server error
Due to this error (in API):
File "/home/vcap/app/app/service/send_notification.py", line 178, in send_pdf_letter_notification
raise e
File "/home/vcap/app/app/service/send_notification.py", line 173, in send_pdf_letter_notification
letter = utils_s3download(current_app.config['TRANSIENT_UPLOADED_LETTERS'], file_location)
File "/home/vcap/deps/0/python/lib/python3.9/site-packages/notifications_utils/s3.py", line 53, in s3download
raise S3ObjectNotFound(error.response, error.operation_name)
notifications_utils.s3.S3ObjectNotFound: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
I checked the DB to verify the letter does actually exist i.e. it
is an instance of the problem we're fixing here.
We previously decided against this [^1] but the lambda alerts are
not a replacement for this one.
Even if both sets of alerts go off, we expect PagerDuty to aggregate
them into a single notification.
[^1]: 65c21f694c
It's worth noting this will break the Admin page to edit provider
priorities, which currently works in units of 10. We already have
work planned to fix this [^1] and it's not an immediate problem.
The automatic adjustment algorithm will continue to work properly
as it can cope with increments smaller than 10 [^2].
[^1]: https://www.pivotaltracker.com/story/show/181681739
[^2]: b145a29935/app/dao/provider_details_dao.py (L122)
The new alert happens earlier but is otherwise the same:
- We only create a ticket in Production.
- We only create a ticket on approval.
I took this opportunity to refactor the alert as a private function
and test this specifically in detail to avoid lots of repetitive
mocks, which are required when calling the main "update" function.
One test I haven't preserved was for when the "names" array is empty,
as this was added for a legacy data integrity scenario [^1].
[^1]: bf0bf4e31c
In response to: [^1].
The stacktrace conveys the same and more information. We don't do
anything different for each exception class, so there's no value
in having three of them over one exception.
I did think about DRYing-up the duplicate exception behaviour into
the base class one. This isn't ideal because the base class would
be making assumptions about how inheriting classes make requests,
which might change with future providers. Although it might be nice
to have more info in the top-level message, we'll still get it in
the stacktrace e.g.
ValueError: Expected 'code' to be '0'
During handling of the above exception, another exception occurred:
app.clients.sms.SmsClientResponseException: SMS client error (Invalid response JSON)
requests.exceptions.ReadTimeout
During handling of the above exception, another exception occurred:
app.clients.sms.SmsClientResponseException: SMS client error (Request failed)
[^1]: https://github.com/alphagov/notifications-api/pull/3493#discussion_r837363717
This works in conjunction with the new SMS provider stub [^1].
Local testing:
- Run the migrations to add Reach as an inactive provider.
- Activate the Reach provider locally and deactivate the others.
update provider_details set priority = 100, active = false where notification_type = 'sms';
update provider_details set active = true where identifier = 'reach';
- Tweak your local environment to point at the SMS stub.
export REACH_URL="http://host.docker.internal:6300/reach"
- Start / restart Celery to pick up the config change.
- Send a SMS via the Admin app and see the stub log it.
- Reset your environment so you can send normal SMS.
update provider_details set active = true where notification_type = 'sms';
update provider_details set active = false where identifier = 'reach';
[^1]: https://github.com/alphagov/notifications-sms-provider-stub/pull/10
This avoids duplicating it as we add a new provider and means we
can test it all in one place (although it wasn't tested before).
I'm not sure why the previous code did "super(..)__init__" in a
non-init function - it's a bit late! - so I've just replaced it
with a call to the new "init_app" function in the parent class.
This reduces the code to copy when we add a new provider. I don't
think we need to log the URL or status code each time:
- The URL is always the same.
- A "200" status code is implicit in "success".
- Other status codes will be reported as exceptions.
Removing these specific elements means "record_outcome" is generic
and can be de-duplicated in the base class.
This is never overridden and can't be used in practie because all
SMS clients have to use the same interface. Removing it will make
it possible to DRY-up some of the code in this method.
Previously we used a combination of "provider.name" and "get_name()"
which was confusing. Using a non-property function also gave me the
impression that the name was more dynamic than it actually is.
This is enough to update a notification in DB:
1. First create a notification in the UI and sent it.
2. Then reset its attributes to pretend it's for Reach.
update notifications set
sent_at = null,
sent_by = null,
notification_status='sending'
where id='some-uuid';
3. Change "notification_id" to "<some-uuid>" in the code.
4. Call the boilerplate endpoint for Reach callbacks.
curl -X POST localhost:6011/notifications/sms/reach
Interestingly there's no foreign key constraint on "sent_by" in the
DB, so this just works: the notification is updated.
Fixes:
> reduced_provider = providers[identifier]
E KeyError: 'firetext'
Note that the mock return value in the other test was wrong [^1].
[^1]: bff97f0bbe/app/dao/provider_details_dao.py (L73)
there's not anything we know we need to do now that we resolve stuck
letters automatically. Letters couuld still get into this state, so it's
worth alerting us. However, we don't have anything concrete that we know
how to fix these letters, so we should just remove the runbook entirely.
Currently "test_send_letter_notification_via_api" fails at the final
stage in create-fake-letter-response-file [^1]:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=6011): Max retries exceeded with url: /notifications/letter/dvla (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xffff95ffc460>: Failed to establish a new connection: [Errno 111] Connection refused'))
This only applies when running in Docker so the default should still
be "localhost" for the Flask app itself.
[^1]: 5093064533/app/celery/research_mode_tasks.py (L57)
It is currently 60 seconds but we have had two incidents in the
past week where there is a connection error talking to a service
and the request takes up to 60 seconds before failing. When this
happens, if there are a few of these callbacks then all of them
will completely hog the service callback worker and build up a big
queue of all the other service callbacks.
5 seconds has been chosen as that is still a pretty decent length
time for a simple web request that should just be giving them a
little bit of information for them to store. 5 seconds should be a
sufficient enough reduction that we dramatically reduce this problem
for the moment.
Open to this number
being changed in the future based on how we see it perform.
Since sept 2019 we've had to log on to production around once every
twenty days to restart the virus scan task for a letter. Most of the
time this is just a case of making sure the file is in the scan bucket,
and then triggering the task. If the file isn't in the scan bucket we'd
need to do some more manual investigation to find out exactly where the
file got stuck, but I can only remember times when it's been in the scan
bucket.
So if the file is in the scan bucket, we can just check that with code
and kick the task off automatically.