By default Marshallow includes unknown properties. This means every time
a new property is added to the service model it gets included in the
JSON-serialised response sent to the admin app.
This is particuarly bad because it means that for returned letters the
ID of every returned letter. So the JSON stored in Redis for the
Check Your State Pension service is 86kb.
Similarly the JSON stored in Redis for a big user of inbound text
messaging is 458kb(!!!) because it has the ID of every received text
message. That’s ~8,500 UUIDs.
Luckily the admin app tells us exactly which keys it’s using here:
5952d9c26d/app/models/service.py (L31-L52)
```python
- `active`
- `contact_link`
- `email_branding`
- `email_from`
- `id`
- `inbound_api`
- `letter_branding`
- `letter_contact_block`
- `message_limit`
- `name`
- `prefix_sms`
- `research_mode`
- `service_callback_api`
- `volume_email`
- `volume_sms`
- `volume_letter`
- `consent_to_research`
- `count_as_live`
- `go_live_user`
- `go_live_at`
}
```
Plus these which it does not get automatically:
- `email_branding`
- `letter_branding`
- `organisation`
- `organisation_type`
- `permissions`
- `restricted`
The API is returning all of these:
- `active`
- `all_template_folders`
- `annual_billing`
- `consent_to_research`
- `contact_link`
- `contact_list`
- `count_as_live`
- `created_by`
- `crown`
- `email_branding`
- `email_from`
- `go_live_at`
- `go_live_user`
- `id`
- `inbound_api`
- `inbound_number`
- `inbound_sms`
- `letter_branding`
- `letter_contact_block`
- `letter_logo_filename`
- `message_limit`
- `name`
- `organisation`
- `organisation_type`
- `permissions`
- `prefix_sms`
- `rate_limit`
- `research_mode`
- `restricted`
- `returned_letters`
- `service_callback_api`
- `users`
- `version`
- `volume_email`
- `volume_letter`
- `volume_sms`
- `whitelist`
So the ones that the admin is getting but not expecting are:
- `all_template_folders`
- `annual_billing`
- `contact_list`
- `created_by`
- `crown`
- `inbound_number`
- `inbound_sms`
- `letter_logo_filename`
- `rate_limit`
- `returned_letters`
- `users`
- `version`
- `whitelist`
Which is what this PR adds to the exclude list, except for `created_by`
which is keeps because it’s needed to validate the JSON provided when
creating a service.
Two new metrics:
auth_db_connection_duration_seconds (histogram)
wraps the first DB call of post notifications. This includes waiting
to get a connection from the pool, and also making the actual request
to the db to retrieve the service and api keys. (i'm not sure there's
an easy way to separate these two things)
post_notification_json_parse_duration_seconds
wraps parsing the v2 post notifications json parsing and schema
validation. Shouldn't include any async code
we have one global metrics variable `metrics = GDSMetrics()`, and we
then call `metrics.init_app` from within the flask application set up.
The v2/test_errors.py app_for_test fixture calls create_app, would call
metrics.init_app multiple times for the same metrics instance. This
causes errors, so change the fixture to session level so it only calls
once per test run.
grab the worker app name and task name rather than the web host and
endpoint. also add a fallback for if we're not in a web request or a
celery task. I think that'll probably happen when we use alembic, or if
we do things from within flask shell
add the following prometheus metrics to keep track of general app
instance health.
# sqs_apply_async_duration
how long does the actual SQS call (a standard web request to AWS) take.
a histogram with default bucket sizes, split up per task that was
created.
# concurrent_web_request_count
how many web requests is this app currently serving. this is split up
per process, so we'd expect multiple responses per app instance
# db_connection_total_connected
how many connections does this app (process) have open to the database.
They might be idle.
# db_connection_total_checked_out
how many connections does this app (process) have open that are
currently in use by a web worker
# db_connection_open_duration_seconds
a histogram per endpoint of how long the db connection was taken from
the pool for. won't have any data if a connection was never opened.
To check if it was our congestion point for the soak test.
(Email stub is a bit slower to respond than Amazon SES, and
the difference could lead to us having to many connections to db open
as we wait for response from the stub)
If we set an environment variable, we can stub out calls to SES and send
them to our own stub app. If the environment variable is not set, things
work as normal.
To be used alongside
https://github.com/alphagov/notifications-email-provider-stub
us not recognising a code or provider not having sent the detailed
status.
It seems like Firetext is sometimes sending us permanent-failure
without detailed status. It could be due to:
- them really not sending any detailed status
- them sending a status code we don't recognise
- them sending 000 code that means 'no errors', which we ignore
To see which one it is, and to debug such issues quicker in the
future, this PR adds status and detailed status codes to the logs.
Also log detailed delivery status for firetext in the same place in addition
to it being logged from notifications_dao.
Logging detailed delivery statuses will help us see why messages
fail to deliver. In the future we could persist detailed delivery
status in the database.
we're using statsd to monitor how long provider requests are taking.
However, there's lots of busy work that happens inside our statsd
metrics timing window. Things like json dumping and loading, building
headers, exception handling, etc.
for firetext/mmg, the response object from requests has an elapsed
property [1], which captures from sending raw data to parsing the
response headers. for ses, it's a bit trickier, but boto3 exposes a few
event hooks [2]. it's hard to find them without stepping through the
code, but the interesting ones are before-call, after-call,
after-call-error, request-created, and response-received. The
before-call and after-call involve some marshalling, built-in retrying,
etc, while request-created and response-received are much lower level.
They might be called more than once per ses request, if boto3 itself
retries the request on 5xx, 429 and low level socket errors [3].
Add these as new `raw-request-time` metrics rather than overwriting to
avoid changing the meaning of an existing metric, and to let us compare
the metrics to see if there's a noticeable difference at all
[1] https://requests.readthedocs.io/en/master/api/#requests.Response.elapsed
[2] https://boto3.amazonaws.com/v1/documentation/api/latest/guide/events.html
[3] https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#legacy-retry-mode
This copies what we do to a user's email address when archiving the user
by prefixing it with `_archived_{date}`. We already prefixed the
service name and email_reply_to with `_archived`, but this didn't allow
a service with the same name to be archived more than once.
The `notifications`, `notification_history`, `templates` and `templates_history`
tables all had a check constraint on the postage column which specified
that the postage had to be `first` or `second` if the notification or
template was a letter. We now have two more options for postage -
`europe` and `rest-of-world`.
It's not possible to alter a check constraint, so the constraints have
to be dropped then recreated. We are not recreating the constraint on
the `notification_history` table since values here are always copied
from the `notifications` table.
The constraints get added as `NOT VALID` at first - this stage will lock
the tables, so updating the `notification` table and `templates` and
`templates_history` are done in separate migrations so that we don't lock
all tables at the same time. In a third migration we then run
`VALIDATE CONSTRAINT` for all tables - this will lock a row at a time,
not the whole table.
When running the purge command I found about 4 users who could not be
deleted because their user id was still referenced in the services table
as they had created the service yet they were not a member of that
service anymore.
I have fixed this by checking that if they are not a member but created
the service then we also delete the service for them.
Note, I've followed the previous convention of no tests for this
function. I've run it locally and executed the code path so there should
be no major flaws in the code. There is a small chance I wasn't able to
exactly replicate the state that existed in preview on my local but
hopefully it was close enough to be accurate.
This PR tries to parse the date, if that throws an error return now as the datereceived. This will at least allow the message to be persisted. Typically the DateReceived, provider_date, and the created_at date in the inbound_sms table are within a second of each other.
If you’ve sent a bunch of jobs from the same contact list then a handy
way to differentiate between them will be date sent, but also template
name (in effect the message you sent).
This commit extends the job response to include template name, using the
same pattern as for template type.
Because we’ll be grouping jobs under their parent contact lists it will
be useful to have a way of showing how many times a contact list has
been used. This will give the right information scent to indicate that
clicking into a contact list is where you go to see its jobs. This means
that the API needs to return a count of jobs for each contact list.
Putting this code feels very non-idiomatic for our API. So suggestions
about how to better architect it welcome…
Rather than showing all jobs that have been ‘copied’ from a contact list
I think it makes more sense to group them under the contact list. This
way it’s easier to see what messages have been sent to a given group of
contacts over time.
Part of this work means the API needs to return only jobs that have been
created from a given contact list, when asked.
We want to add another argument here, and doing so would make the line
length too long with all the arguments on one line.
Also uses the * operator to enforce keyword-only arguments.
Because we won’t be showing uploaded letters individually on the uploads
page any more we need a way of listing them. This should be by printing
day, to match how we’re grouping them on the uploads page.
The response matches a normal `get_notifications` response so we can
reuse the same code in the admin app.
Some teams have started uploading quite a lot of letters (in the
hundreds per week). They’re also uploading CSVs of emails. This means
the uploads page ends up quite jumbled.
This is because:
- there’s just a lot of items to scan through
- conceptually it’s a bit odd to have batches of things displayed
alongside individual things on the same page
So instead this commit starts grouping together uploaded letters. It
does this by the date on which we ‘start’ printing them, or in other
words the time at which they can no longer be cancelled.
This feels like a natural grouping, and it matches what we know about
people’s mental models of ‘batches’ and ‘runs’ when talking about
printing.
The code for this is a bit gnarly because:
- timezones
- the print cutoff doesn’t align with the end of a day
- we have to do this in SQL because it wouldn’t be efficient to query
thousands of letters and then do the timezone calculations on them in
Python
The standard way that we indicate that there are more results than can
be returned is by paginating. So even though we don’t intend to paginate
the search results in the admin app, it can still use the presence or
absence of a ‘next’ link to determine whether or not to show a message
about only showing the first 50 results.