This prevents a race condition when we get delivery receipt before
updating notification to sending, and so the sending status would
supersede the delivered status, and the notification would time out
as temporary-failure after three days.
As we gradually move from statsd to prometheus, we change the metric to
be a prometheus metric rather than statsd.
The change worth pointing out is that we have dropped the 'successful'
and 'failed' statuses from the metrics. I don't think it's useful to
have these statuses. It's very rare for an inbound message to fail when
we receive it and when it does, we raise an error and see it in our
logs. We aren't going to be looking at a graph of it as it's a rare
event, not typical behaviour that we want to monitor with a graph.
This was broken because sometimes `service.permissions` is a list of
strings (for when we’re caching the service object) and sometimes it’s a
list of permission objects (when we’re dealing with ORM objects).
Because the validator code is shared, the least-messy way to fix it is
to make sure it can handle both types.
It can’t just take a list of permissions as argument, because it uses
other fields on the service.
It would be messy to rewrite the endpoint to use a serialised service
because the tests all expect to be dealing with database objects, so it
would be a faff to change what they’re mocking.
This commit turns off StatsD metrics for the following
- the `dao_create_notification` function
- the `user-agent` metric
- the response times and response codes per flask endpoint
This has been done with the purpose of not having the creation of text
messages or emails call out to StatsD during the request process. These
are the three current metrics that are currently called during the
processing of one of those requests and so have been removed from the
API.
The reason for removing the calls out to StatsD when processing a
request to send a notification is that we have seen two incidents
recently affected by DNS resolution for StatsD (either by a slow down in
resolution time or a failure to resolve). These POST requests are our
most critical code path and we don't want them to be affected by any
potential unforeseen trouble with StatsD DNS resolution.
We are not going to miss the removal of these metrics.
- the `dao_create_notification` metric is rarely/never looked at
- the `user-agent` metric is rarely/never looked at and can be got from
our logs if we want it
- the response times and response codes per flask endpoint are already
exposed using the gds metrics python library
I did not remove the statsd metrics from any other parts of the API
because
- As the POST notification endpoints are the main source of web traffic,
we should have already removed most calls to StatsD which should
greatly reduce the chance of their being any further issues with
DNS resolution
- Some of the other metrics still provide value so no point deleting
them if we don't need to
- The metrics on celery tasks will not affect any HTTP requests from
users as they are async and also we do not currently have the
infrastructure set up to replace them with a prometheus HTTP endpoint that
can be scraped so this would require more work
Because the IDs of our callback and inbound SMS APIs were stored in
lists instead of directly on the serialised model they weren’t getting
cast to a string before trying to JSONify them. And JSON doesn’t know
what to do with a UUID object.
For some reason this was only affecting the endpoint for fetching
inbound SMS.
The API needs these to check whether a service can send a notification.
This commit also updates all the tests in `test_validators.py` to take
a serialised service, not a database object.
Years ago we started to implement a way to schedule a notification. We hit a problem but we never came up with a good solution and the feature never made it back to the top of the priority list.
This PR removes the code for scheduled_for. There will be another PR to drop the scheduled_notifications table and remove the schedule_notifications service permission
Unfortunately, I don't think we can remove the `scheduled_for` attribute from the notification.serialized method because out clients might fail if something is missing. For now I have left it in but defaulted the value to None.
Same as we’re doing for templates.
This means avoiding a database call, even for services that don’t hit
our API so often.
They’ll still need to go to the database for the API keys, because we’re
not comfortable putting the API key secrets in Redis.
But once a service has got its keys from the database we commit the
transaction, so the connection can be freed up until we need it again to
insert the notification.
Same as we’ve done for templates.
For high volume services this should mean avoiding calls to external
services, either the database or Redis.
TTL is set to 2 seconds, so that’s the maximum time it will take for
revoking an API key or renaming a service to propagate.
Some of the tests created services with the same service ID. This
caused intermittent failures because the cache relies on unique service
IDs (like we have in the real world) to key itself.
By serialising these straight away we can:
- not go back to the database later, potentially closing the connection
sooner
- potentially cache the serialised data, meaning we don’t touch the
database at all
For some reason our V1 get template response wraps the whole template in
a dictionary with one key, `'data'`:
0d99033889/app/template/rest.py (L166)
That means when the admin app caches the response it also caches it in
this format.
The API needs to do the same, otherwise it will be cacheing data with a
schema that the admin app isn’t expecting, and vice-versa.
Content and subject are user-submitted so are effectively unbounded in
size. And we’re serialising them for every template when sending the
list of templates to the admin app.
For the service with the most templates this results in a 1.3Mb blob of
JSON going over the wire, and being cached in Redis.
And then the admin app completely ignores these fields, because it does
show template content until you’ve clicked into a single template.
This commit adds a new query parameter, `detailed`, that the admin app
can set to `False`. When it does only the fields needed to render the
`/templates` page are returned.
This is done with a new parameter so as not to break the V1 API.
Although I looked in Kibana and it doesn’t seem like anyone external is
using this endpoint we’ve come this far without breaking the API so…
This commit changes the code in post notification endpoint to handle a
serialised template (ie a `dict`) rather than a database object.
This is the first step towards being able to cache the template and not
hit the database on every request.
There should be no functional changes here, it’s just refactoring.
There are some changes to the tests where the signature of functions
has changed.
Importing of the template schema has to be done at a function level,
otherwise Marshmallow gets weird.
This commit also copies the `JSONModel` class from the admin app, which
turns serialised data (a dict made from JSON) into an object on which
certain predefined properties are allowed.
This means we can still do the caching of serialised data, without
having to change too much of the code in the app, or make it ugly by
sprinkling dict lookups everywhere.
We’re not copying all of JSONModel from the admin app, just the bits we
need. We don’t need to compare or hash these objects, they’re just used
for lookups. And redefining `__getattribute__` scares Leo.
We were checking this separately in two places in the code. Now
we will have this logic in one place, in validators.
Also pull in utils version that recognises crown depenency numbers
as international.
The constaints on notifications and notification_history have already
been dropped in production, but still exist in staging and in dev
environments.
The constraints on templates and templates_history exist in all
environments.
We did not have a JSON schema for updating a template. Since we will
remove the postage constraint from the templates table, this adds a JSON
schema for updating a template so that we can use it to check that the
postage is one of the allowed values.
In the previous PR I removed the `update_notification` method to reduce the need for another update query. However, that meant the notification was marked as delivered without an updated_at timestamp.
It is weird to set the updated_at when we create the notification. So is this a better fix? Or do I put the update back now?
I recommend we push this fix now.
By default Marshallow includes unknown properties. This means every time
a new property is added to the service model it gets included in the
JSON-serialised response sent to the admin app.
This is particuarly bad because it means that for returned letters the
ID of every returned letter. So the JSON stored in Redis for the
Check Your State Pension service is 86kb.
Similarly the JSON stored in Redis for a big user of inbound text
messaging is 458kb(!!!) because it has the ID of every received text
message. That’s ~8,500 UUIDs.
Luckily the admin app tells us exactly which keys it’s using here:
5952d9c26d/app/models/service.py (L31-L52)
```python
- `active`
- `contact_link`
- `email_branding`
- `email_from`
- `id`
- `inbound_api`
- `letter_branding`
- `letter_contact_block`
- `message_limit`
- `name`
- `prefix_sms`
- `research_mode`
- `service_callback_api`
- `volume_email`
- `volume_sms`
- `volume_letter`
- `consent_to_research`
- `count_as_live`
- `go_live_user`
- `go_live_at`
}
```
Plus these which it does not get automatically:
- `email_branding`
- `letter_branding`
- `organisation`
- `organisation_type`
- `permissions`
- `restricted`
The API is returning all of these:
- `active`
- `all_template_folders`
- `annual_billing`
- `consent_to_research`
- `contact_link`
- `contact_list`
- `count_as_live`
- `created_by`
- `crown`
- `email_branding`
- `email_from`
- `go_live_at`
- `go_live_user`
- `id`
- `inbound_api`
- `inbound_number`
- `inbound_sms`
- `letter_branding`
- `letter_contact_block`
- `letter_logo_filename`
- `message_limit`
- `name`
- `organisation`
- `organisation_type`
- `permissions`
- `prefix_sms`
- `rate_limit`
- `research_mode`
- `restricted`
- `returned_letters`
- `service_callback_api`
- `users`
- `version`
- `volume_email`
- `volume_letter`
- `volume_sms`
- `whitelist`
So the ones that the admin is getting but not expecting are:
- `all_template_folders`
- `annual_billing`
- `contact_list`
- `created_by`
- `crown`
- `inbound_number`
- `inbound_sms`
- `letter_logo_filename`
- `rate_limit`
- `returned_letters`
- `users`
- `version`
- `whitelist`
Which is what this PR adds to the exclude list, except for `created_by`
which is keeps because it’s needed to validate the JSON provided when
creating a service.
After the commit we issue two calls to the db to get service and get notification. This is because after the commit the ORM wants to ensure that the data model objects are the latest.
So far this is just a proof of concept, but the letter flow needs to be updated and we should be able to get rid of research mode. And it needs some tidy up.
we have one global metrics variable `metrics = GDSMetrics()`, and we
then call `metrics.init_app` from within the flask application set up.
The v2/test_errors.py app_for_test fixture calls create_app, would call
metrics.init_app multiple times for the same metrics instance. This
causes errors, so change the fixture to session level so it only calls
once per test run.
us not recognising a code or provider not having sent the detailed
status.
It seems like Firetext is sometimes sending us permanent-failure
without detailed status. It could be due to:
- them really not sending any detailed status
- them sending a status code we don't recognise
- them sending 000 code that means 'no errors', which we ignore
To see which one it is, and to debug such issues quicker in the
future, this PR adds status and detailed status codes to the logs.
Also log detailed delivery status for firetext in the same place in addition
to it being logged from notifications_dao.
Logging detailed delivery statuses will help us see why messages
fail to deliver. In the future we could persist detailed delivery
status in the database.
This copies what we do to a user's email address when archiving the user
by prefixing it with `_archived_{date}`. We already prefixed the
service name and email_reply_to with `_archived`, but this didn't allow
a service with the same name to be archived more than once.