The XML for an alerts requires a `<description>` field. The XML for
a `<cancel>` may have a `<description>` field populated (although we
ignore the contents) but it may also be empty.
This commit updates the schema to leave the all the validation to the
view layer, which can decide when or when not to validate the content of
the `<description>` field.
Previously we were looping over data from the Notifications/History
table and then shovelling it into the status table, one row at a time
- plus an extra delete to clean up any existing data.
This replaces that with a batch insertion, similar to how we archive
notifications [1], but using a simple subquery (via "from_select" [2])
instead of a temporary table.
To make the select compatible with the insert, I've used "literal"
to inject the constant pieces of data, so each row has everything it
needs to go into the status table.
[1]: 9ce6d2fe92/app/dao/notifications_dao.py (L295)
[2]: https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select
This is consistent with the way we do billing updates [1] and is a
bit less clunky. Functionally it should be the same - note that the
tests already cover the "overwriting" behaviour if a row exists.
[1]: 9ce6d2fe92/app/dao/fact_billing_dao.py (L522)
If someone tries to cancel a broadcast but the references don’t match
and existing broadcast we correctly return a 404.
If they don’t provide any references then we get an exception. This
commit catches the missing references and returns a 400. I think this
is more appropriate because it’s malformed request, rather than a
well-formed request that doesn’t match our data. It also lets us write a
more specific and helpful error message.
Addresses [1].
Previously the query would always use UTC midnight, even after we
had switched to BST (+1h). We store timestamps as naive UTC in our
DB - without a timezone - but we want the query to work in terms
of GMT / BST so we adjust for that - BST midnight is 11PM in UTC.
[1]: https://github.com/alphagov/notifications-api/pull/3437#discussion_r791998690
This takes a similar approach to the nightly deletion task so that
we only create sub-tasks when there are actually notifications to
aggregate for a given type and day [1].
We're making this change to stop the duplication errors we're getting
at the moment and ensure the task can scale to more messages and more
services. There are two parts to this:
- Each subtask should now run within the 5 minute visibility timeout.
However, they may still be duplicated if the parent task overruns [2].
- The parent task creates a mininal number of subtasks, and the query
to determine this is very fast for a normal process day (milliseconds).
Since all tasks will run quickly, there should be no more duplication.
In order to test this more nuanced task, I rewrote the tests:
- One test checks the subtask is called correctly.
- One test checks we create all the right subtasks.
[1]: https://github.com/alphagov/notifications-api/pull/3381
[2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl
These can be inferred elsewhere:
- Task creation is obvious from task execution. If we're concerned
about a specific service, we can check the updated times on the DB
records, since all records are recreated each time this runs.
- Task starting is already logged.
- Task completion is already logged. The number of rows updated can
also be inferred from the DB.
The log I've found useful is the one about fetching the data, and
I've also added another to time how long it takes to insert the data,
as both could be sources of poor performance.
Arguably we should use metrics for this sort of thing, but logs are
easier in practice for the metric systems we have.
Changes:
53.0.0
---
* `notifications_utils.columns.Columns` has moved to
`notifications_utils.insensitive_dict.InsensitiveDict`
* `notifications_utils.columns.Rows` has moved to
`notifications_utils.recipients.Rows`
* `notifications_utils.columns.Cell` has moved to
`notifications_utils.recipients.Cell`
52.0.0
---
* Deprecate the following unused `redis_client` functions:
- `redis_client.increment_hash_value`
- `redis_client.decrement_hash_value`
- `redis_client.get_all_from_hash`
- `redis_client.set_hash_and_expire`
- `redis_client.expire`
51.3.1
---
* Bump govuk-bank-holidays to cache holidays for next year.
Add check constraint that created_by_id should not be null, unless
created_by_api_key_id is not null to the migration script. It is
already in the models file.
Also remove check constraint for cancelled_by_id from models, as
this field would only be filled for broadcasts with cancelled
status.
Also add some spacing in that migration script so it is easier
to read.
This follows the pattern for invite emails where the admin app tells the
API which domain to use when generating the link.
This will starting working once this admin change is merged:
- [ ] https://github.com/alphagov/notifications-admin/pull/4150/files
It won’t break anything if it’s merged before the admin change.
as a team we primarily develop locally. However, we've been experiencing
issues with pycurl, a subdependency of celery, that is notoriously
difficult to install on mac. On top of the existing issues, we're also
seeing it conflict with pyproj in bizarre ways (where the order of
imports between pyproj and pycurl result in different configurations of
dynamically linked C libraries being loaded.
You are encouraged to attempt to install pycurl locally, following these
instructions: https://github.com/alphagov/notifications-manuals/wiki/Getting-Started#pycurl
However, if you aren't having any luck, you can instead now run celery
in a docker container.
`make run-celery-with-docker`
This will build a container, install the dependencies, and run celery
(with the default of four concurrent workers).
It will pull aws variables from your aws configuration as boto would
normally, and it will attempt to connect to your local database with the
user `postgres`. If your local database is configured differently (for
example, with a different user, or on a different port), then you can
set the SQLALCHEMY_DATABASE_URI locally to override that.
This caught me out the other day when I was testing. I thought antivirus
was enable, seeing this message didn't help with that assumption. Plus
there is no point in showing the log if we don't actually call the task.
Previously we think this setting was necessary to avoid a memory
leak [1], but it's unclear if this is still an issue:
- We've advanced two major versions of Celery.
- Some of the tasks are now quicker and leaner.
Restarting worker sub-processes after each task is a big problem
for performance, as we move towards parallelising our reporting.
This is something of a test to see if we can manage without this
setting. Note that we need to unset the variable manually:
cf unset-env notify-delivery-worker-reporting CELERYD_MAX_TASKS_PER_CHILD
In the worst case we can always re-run any failed tasks. To check
the worker is still behaving as expected, we can:
- Monitor CPU / memory graphs for it.
- Check `cf events` for unexpected restarts / crashes.
- Compare numbers of task completion logs to previous days.
- Check the number of new billing / status rows looks right.
[1]: ad419f7592
we try and delete for lots of services. this includes services that
don't actually have anything to delete that day. that might be because
they had a custom data retention so we always go to check them, or
because they only sent test notifications (which we'll delete but not
include in the count in the log line). we don't really need to see log
lines saying that we didn't delete anything for that service - that's
just a long list of boring log messages that will hide the actual
interesting stuff - which services we did delete content for.
This is because that function is used both when broadcast status
is updated via API and via admin, so it's a shared resource.
Also move and update tests for updating broadcast message status
so things are tested at source and repetition is avoided.
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
If the reference from cancel CAP XML we received via API does not
match with any existing broadcast, return 404.
Do the same if service id doesn't match.
Also refactor code to cancel broadcast out into separate function
It should be a separate function that is only called by create_broadcast
function. This will prevent create_broadcast from becoming too
big and complex and doing too many things.
1. The number of letters that we send to DVLA will be not be correct (see 20ead82463/app/celery/letters_pdf_tasks.py (L136))
This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported.
2. When we get the PDF from S3 we will get a file not found 20ead82463/app/celery/letters_pdf_tasks.py (L244)
The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions.
Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.
It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications.
NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload.
A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task.
New work flow for a letter in brief:
API - receives POST /v2/notifications
:: save to db
:: put CREATE_LETTERS_PDF task on queue for template preview to consume
TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF
:: create PDF
:: count pages of PDF
:: IF page count exceeds 10 pages
put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters)
put `update-validation-failed-for-templated-letter` task on the queue for the API to consume
ELSE
put PDF in the `letters-pdf` bucket
put `update-billable-units-for-letter` task on the queue
API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task
:: IF `update-billable-units-for-letter` task:
update billable units for notification as usual
:: ELSE `update-validation-failed-for-templated-letter`:
update notification_status = `validation-failed`
ADMIN - view notification page for letter
:: show validation letter for templated letter
There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app.
Deployment plan
Deploy Admin first
Deploy API
Deploy template-preview
Related PRs:
alphagov/notifications-template-preview#619alphagov/notifications-admin#4107https://www.pivotaltracker.com/story/show/169209742
This field caused some confusion and lots of unnecessary work
to our colleague because of unclear name.
The field was named sms_fragments, where in fact the value of
the field is: those sms fragments that go above free allowance
multiplied by the rate multiplier.
The new name was chosen through consultation with colleagues
who use billing report the most.