It’s a UUID column, but by default Marshmallow wants to select the id
from the users table, not from the templates table, because the two
are foreign-keyed.
Adding the property explicity like this forces it to select from the
`created_by_id` column, but still serialises it to the `created_by`
field to avoid any breaking change.
Content and subject are user-submitted so are effectively unbounded in
size. And we’re serialising them for every template when sending the
list of templates to the admin app.
For the service with the most templates this results in a 1.3Mb blob of
JSON going over the wire, and being cached in Redis.
And then the admin app completely ignores these fields, because it does
show template content until you’ve clicked into a single template.
This commit adds a new query parameter, `detailed`, that the admin app
can set to `False`. When it does only the fields needed to render the
`/templates` page are returned.
This is done with a new parameter so as not to break the V1 API.
Although I looked in Kibana and it doesn’t seem like anyone external is
using this endpoint we’ve come this far without breaking the API so…
This commit changes the code in post notification endpoint to handle a
serialised template (ie a `dict`) rather than a database object.
This is the first step towards being able to cache the template and not
hit the database on every request.
There should be no functional changes here, it’s just refactoring.
There are some changes to the tests where the signature of functions
has changed.
Importing of the template schema has to be done at a function level,
otherwise Marshmallow gets weird.
This commit also copies the `JSONModel` class from the admin app, which
turns serialised data (a dict made from JSON) into an object on which
certain predefined properties are allowed.
This means we can still do the caching of serialised data, without
having to change too much of the code in the app, or make it ugly by
sprinkling dict lookups everywhere.
We’re not copying all of JSONModel from the admin app, just the bits we
need. We don’t need to compare or hash these objects, they’re just used
for lookups. And redefining `__getattribute__` scares Leo.
We were checking this separately in two places in the code. Now
we will have this logic in one place, in validators.
Also pull in utils version that recognises crown depenency numbers
as international.
We did not have a JSON schema for updating a template. Since we will
remove the postage constraint from the templates table, this adds a JSON
schema for updating a template so that we can use it to check that the
postage is one of the allowed values.
see https://github.com/alphagov/notifications-utils/pull/752 for
implementation details. Cache DNS for 15 seconds. Note: This cache is
per eventlet, so concurrent requests will handle their requests
separately, but the cache does persist between sequential requests.
re-enable statsd for all environments.
In the previous PR I removed the `update_notification` method to reduce the need for another update query. However, that meant the notification was marked as delivered without an updated_at timestamp.
It is weird to set the updated_at when we create the notification. So is this a better fix? Or do I put the update back now?
I recommend we push this fix now.
We have seen bad latency across all apps in the last week. After running
a canary with statsd turned off we have seen these issues dissapear on
that canary instance. We therefore will turn off statsd for all
instances of the API. We will still need to investigate tomorrow what
exactly changed or is causing the issue with statsd as we hadn't made
any changes to it ourself.
Note, this keeps statsd on for all the other apps but this will be a
first step. We will also need to check performance of the other apps
after releasing this.
By default Marshallow includes unknown properties. This means every time
a new property is added to the service model it gets included in the
JSON-serialised response sent to the admin app.
This is particuarly bad because it means that for returned letters the
ID of every returned letter. So the JSON stored in Redis for the
Check Your State Pension service is 86kb.
Similarly the JSON stored in Redis for a big user of inbound text
messaging is 458kb(!!!) because it has the ID of every received text
message. That’s ~8,500 UUIDs.
Luckily the admin app tells us exactly which keys it’s using here:
5952d9c26d/app/models/service.py (L31-L52)
```python
- `active`
- `contact_link`
- `email_branding`
- `email_from`
- `id`
- `inbound_api`
- `letter_branding`
- `letter_contact_block`
- `message_limit`
- `name`
- `prefix_sms`
- `research_mode`
- `service_callback_api`
- `volume_email`
- `volume_sms`
- `volume_letter`
- `consent_to_research`
- `count_as_live`
- `go_live_user`
- `go_live_at`
}
```
Plus these which it does not get automatically:
- `email_branding`
- `letter_branding`
- `organisation`
- `organisation_type`
- `permissions`
- `restricted`
The API is returning all of these:
- `active`
- `all_template_folders`
- `annual_billing`
- `consent_to_research`
- `contact_link`
- `contact_list`
- `count_as_live`
- `created_by`
- `crown`
- `email_branding`
- `email_from`
- `go_live_at`
- `go_live_user`
- `id`
- `inbound_api`
- `inbound_number`
- `inbound_sms`
- `letter_branding`
- `letter_contact_block`
- `letter_logo_filename`
- `message_limit`
- `name`
- `organisation`
- `organisation_type`
- `permissions`
- `prefix_sms`
- `rate_limit`
- `research_mode`
- `restricted`
- `returned_letters`
- `service_callback_api`
- `users`
- `version`
- `volume_email`
- `volume_letter`
- `volume_sms`
- `whitelist`
So the ones that the admin is getting but not expecting are:
- `all_template_folders`
- `annual_billing`
- `contact_list`
- `created_by`
- `crown`
- `inbound_number`
- `inbound_sms`
- `letter_logo_filename`
- `rate_limit`
- `returned_letters`
- `users`
- `version`
- `whitelist`
Which is what this PR adds to the exclude list, except for `created_by`
which is keeps because it’s needed to validate the JSON provided when
creating a service.
After the commit we issue two calls to the db to get service and get notification. This is because after the commit the ORM wants to ensure that the data model objects are the latest.
So far this is just a proof of concept, but the letter flow needs to be updated and we should be able to get rid of research mode. And it needs some tidy up.
Two new metrics:
auth_db_connection_duration_seconds (histogram)
wraps the first DB call of post notifications. This includes waiting
to get a connection from the pool, and also making the actual request
to the db to retrieve the service and api keys. (i'm not sure there's
an easy way to separate these two things)
post_notification_json_parse_duration_seconds
wraps parsing the v2 post notifications json parsing and schema
validation. Shouldn't include any async code
we have one global metrics variable `metrics = GDSMetrics()`, and we
then call `metrics.init_app` from within the flask application set up.
The v2/test_errors.py app_for_test fixture calls create_app, would call
metrics.init_app multiple times for the same metrics instance. This
causes errors, so change the fixture to session level so it only calls
once per test run.
grab the worker app name and task name rather than the web host and
endpoint. also add a fallback for if we're not in a web request or a
celery task. I think that'll probably happen when we use alembic, or if
we do things from within flask shell
add the following prometheus metrics to keep track of general app
instance health.
# sqs_apply_async_duration
how long does the actual SQS call (a standard web request to AWS) take.
a histogram with default bucket sizes, split up per task that was
created.
# concurrent_web_request_count
how many web requests is this app currently serving. this is split up
per process, so we'd expect multiple responses per app instance
# db_connection_total_connected
how many connections does this app (process) have open to the database.
They might be idle.
# db_connection_total_checked_out
how many connections does this app (process) have open that are
currently in use by a web worker
# db_connection_open_duration_seconds
a histogram per endpoint of how long the db connection was taken from
the pool for. won't have any data if a connection was never opened.
To check if it was our congestion point for the soak test.
(Email stub is a bit slower to respond than Amazon SES, and
the difference could lead to us having to many connections to db open
as we wait for response from the stub)
If we set an environment variable, we can stub out calls to SES and send
them to our own stub app. If the environment variable is not set, things
work as normal.
To be used alongside
https://github.com/alphagov/notifications-email-provider-stub
us not recognising a code or provider not having sent the detailed
status.
It seems like Firetext is sometimes sending us permanent-failure
without detailed status. It could be due to:
- them really not sending any detailed status
- them sending a status code we don't recognise
- them sending 000 code that means 'no errors', which we ignore
To see which one it is, and to debug such issues quicker in the
future, this PR adds status and detailed status codes to the logs.
Also log detailed delivery status for firetext in the same place in addition
to it being logged from notifications_dao.
Logging detailed delivery statuses will help us see why messages
fail to deliver. In the future we could persist detailed delivery
status in the database.
we're using statsd to monitor how long provider requests are taking.
However, there's lots of busy work that happens inside our statsd
metrics timing window. Things like json dumping and loading, building
headers, exception handling, etc.
for firetext/mmg, the response object from requests has an elapsed
property [1], which captures from sending raw data to parsing the
response headers. for ses, it's a bit trickier, but boto3 exposes a few
event hooks [2]. it's hard to find them without stepping through the
code, but the interesting ones are before-call, after-call,
after-call-error, request-created, and response-received. The
before-call and after-call involve some marshalling, built-in retrying,
etc, while request-created and response-received are much lower level.
They might be called more than once per ses request, if boto3 itself
retries the request on 5xx, 429 and low level socket errors [3].
Add these as new `raw-request-time` metrics rather than overwriting to
avoid changing the meaning of an existing metric, and to let us compare
the metrics to see if there's a noticeable difference at all
[1] https://requests.readthedocs.io/en/master/api/#requests.Response.elapsed
[2] https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
[3] https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#legacy-retry-mode