1. The number of letters that we send to DVLA will be not be correct (see 20ead82463/app/celery/letters_pdf_tasks.py (L136))
This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported.
2. When we get the PDF from S3 we will get a file not found 20ead82463/app/celery/letters_pdf_tasks.py (L244)
The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions.
Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.
It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications.
NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload.
A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task.
New work flow for a letter in brief:
API - receives POST /v2/notifications
:: save to db
:: put CREATE_LETTERS_PDF task on queue for template preview to consume
TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF
:: create PDF
:: count pages of PDF
:: IF page count exceeds 10 pages
put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters)
put `update-validation-failed-for-templated-letter` task on the queue for the API to consume
ELSE
put PDF in the `letters-pdf` bucket
put `update-billable-units-for-letter` task on the queue
API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task
:: IF `update-billable-units-for-letter` task:
update billable units for notification as usual
:: ELSE `update-validation-failed-for-templated-letter`:
update notification_status = `validation-failed`
ADMIN - view notification page for letter
:: show validation letter for templated letter
There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app.
Deployment plan
Deploy Admin first
Deploy API
Deploy template-preview
Related PRs:
alphagov/notifications-template-preview#619alphagov/notifications-admin#4107https://www.pivotaltracker.com/story/show/169209742
This field caused some confusion and lots of unnecessary work
to our colleague because of unclear name.
The field was named sms_fragments, where in fact the value of
the field is: those sms fragments that go above free allowance
multiplied by the rate multiplier.
The new name was chosen through consultation with colleagues
who use billing report the most.
This follows a similar approach as [1]. Recently we've seen lots
of errors from this task, which we think are a consequence of it
doing too much work and tripping Celery's visibility timeout.
While we can optimise the query [2], it's likely the errors will
return as the number of live services grows. Parallelising the
aggregation now will make it more futureproof.
[1]: https://github.com/alphagov/notifications-api/pull/3397
[2]: https://github.com/alphagov/notifications-api/pull/3417
This is similar to the corresponding endpoint for services. However,
it is a little simpler since we don't need to worry about always having
at least one team member for an organisation.
The new dao function added, `dao_remove_user_from_organisation`, is also
simpler than `dao_remove_user_from_service` since we don't have any
organisation permissions to deal with.
Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification
history table shows this is another instance of [1] but for the key
type column. Swapping "!=" for "IN" solves the problem.
[1]: https://github.com/alphagov/notifications-api/pull/3360
If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing.
There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.
We want admin to send a POST request to this route if the data contains
a message recipient (a phone number or email address) so that this does
not show in the logs. This changes the route to accept both GET and POST
requests.
When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.
At the moment, when we are processing and sending an SMS we open
a DB connection at the start of the celery task and then close it
at the end of the celery task. Nice and simple.
However, during that celery task we make an HTTP call out to our
SMS providers. If our SMS providers have problems or response times
start to slow then it means we have an open DB connection sat waiting
for our SMS providers to respond which could take seconds. If our
SMS providers grind to a halt, this would cause all of the
celery tasks to hold on to their connections and we would run out
of DB connections and Notify would fall over.
We think we can solve this by closing the DB session which releases
the DB connection back to the pool.
Note, we've seen this happen in staging during load testing if our
SMS provider stub has fallen over. We've never seen it in production
and it may be less unlikely to happen as we are balancing traffic
across two providers and they generally have very good uptime.
One downside to be aware of is there could be a slight increase in
time spent to send an SMS as we will now spend a bit of extra time
closing the DB session and then reopening it again after the HTTP
request is done.
Note, there is no reason this approach couldn't be copied for our
email provider too if it appears successful.
We can't control who might be sending messages on inbound numbers
that we own i.e. this log isn't an actionable error. Looks like it
used to represent something that _was_ an error [1], but that's not
the case anymore, so it seems reasonable to downgrade it.
[1]: d99ab329eb (diff-80d123d9abb40f80a221979940657a2751cc7cb33f255aa8f352a8324023e022L125)
If the S3 object is missing [1], then that's what we want, so we
don't need such a severe log for it, but we still want to know as
it's not expected. This is separate to more general "ClientError"
exceptions, which could mean anything.
There weren't any tests to cover missing S3 objects, so I've added
one. I don't think we need a test for ClientErrors:
- If there was no handler, the task would fail and we'd learn about
it that way.
- The scope of the calling task is now much smaller, so it matters
less than it used to [2].
[1]: 81a79e56ce/app/letters/utils.py (L52)
[2]: f965322f25
we really don't gain anything by running each service delete in sequence
- we get the services, and then just loop through them deleting per
service. By deleting per service in separate tasks, we can take
advantage of parallelism. the only thing we lose is some log lines but I
don't think we're that interested in them.
only set query limit at the move_notifications dao function - the task
doesn't really care about the technical implementation of how it deletes
the notifications
This will help us monitor issues with delivery receipts and keep
track of provider performance over time.
I'm not concerned about performance here:
- The number of notifications to time out is usually small.
- This task only runs once a day.
- Calls to StatsD are quick and cheap.
A gauge is more useful as we can visualise it and combine it with
other stats - we already have other stats for the total number of
notifications sent by provider, and we can extrapolate the number
of slow notifications using this, if needed.
We also still have logs to say the task is running, as well as a
log in the calling code when we actually make a switch [1], so
we're not losing anything by removing the log here.
[1]: a9306c4557/app/celery/scheduled_tasks.py (L117)
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.
This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.
[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
We're going to iterate how we use the function with a limit, so we
shouldn't say it's "temporary" anymore. We don't need to change the
default, but having it in the function parameters makes it easier to
see the funtion doesn't time out all notifications, just some.
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
We have been running in to the problem in
pallets/flask-sqlalchemy#518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.
As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.
This commit is analagous to
c68d1a2f23
The only difference is that in that case, the pagination links are
used to show prev and/or next links in the admin app. In this case,
the pagination links are only used to see if there is a page 2, and
if there is, say that we are only showing the first 50 results.
The way it was done before, the remainder was incorrect in the
billing report and in the org usage query - it was the sms remainder
left at the start of the report period, not at the end of that period.
This became apparent when we tried to show sms_remainder on the org
usage report, where start date is always the start of the financial year.
We saw that sms sent by services did not reduce their free allowance
remainder according to the report. As a result of this, we had to
temporarily remove of sms_remainder column from the report, until
we fix the bug - it has been fixed now, yay!
I think the bug has snuck in partially because our fixtures for testing
this part of the code are quite complex, so it was
harder to see that numbers don't add up. I have added comments
to the tests to try and make it a bit clearer why the results are
as they are.
I also added comments to the code, and renamed some variables,
to make it easier to understand, as there are quite a few
moving parts in it - subqueries and the like.
I also renamed the fetch_sms_free_allowance_remainder method to
fetch_sms_free_allowance_remainder_until_date so it is clearer
what it does.