We've run into issues with redis expiring keys while we try and write
to them - short lived redis TTLs aren't really sustainable for keys
where we mutate the state. Template usage is a hash contained in redis
where we increment a count keyed by template_id each time a message is
sent for that template. But if the key expires, hincrby (redis command
for incrementing a value in a hash) will re-create an empty hash.
This is no good, as we need the hash to be populated with the last
seven days worth of data, which we then increment further. We can't
tell whether the hincrby created the key, so a different approach
entirely was needed:
* New redis key: <service_id>-template-usage-<YYYY-MM-DD>. Note: This
YYYY-MM-DD is BTC time so it lines up nicely with ft_billing table
* Incremented to from process_notification - if it doesn't exist yet,
it'll be created then.
* Expiry set to 8 days every time it's incremented to.
Then, at read time, we'll just read the last eight days of keys from
Redis, and sum them up. This works because we're only ever incrementing
from that one place - never setting wholesale, never recreating the
data from scratch. So we know that if the data is in redis, then it is
good and accurate data.
One thing we *don't* know and *cannot* reason about is what no key in
redis means. It could be either of:
* This is the first message that the service has sent today.
* The key was deleted from redis for some reason.
Since we set the TTL to so long, we'll never be writing to a key that
previously expired. But if there is a redis (or operator) error and the
key is deleted, then we'll have bad data - after any data loss we'll
have to rebuild the data.
This PR is a proposal to reduce the average messages we see for a single notification from about 7 messages to 2.
Messaging would change to something like this:
February 2nd 2018, 15:39:05.885 Full delivery response from Firetext for notification: 8eda51d5-cd82-4569-bfc9-d5570cdf2126
{'status': ['0'], 'reference': ['8eda51d5-cd82-4569-bfc9-d5570cdf2126'], 'time': ['2018-02-02 15:39:01'], 'code': ['000']}
February 2nd 2018, 15:39:05.885 Firetext callback return status of 0 for reference: 8eda51d5-cd82-4569-bfc9-d5570cdf2126
February 2nd 2018, 15:38:57.727 SMS 8eda51d5-cd82-4569-bfc9-d5570cdf2126 sent to provider firetext at 2018-02-02 15:38:56.716814
February 2nd 2018, 15:38:56.727 Starting sending SMS 8eda51d5-cd82-4569-bfc9-d5570cdf2126 to provider at 2018-02-02 15:38:56.408181
February 2nd 2018, 15:38:56.727 Firetext request for 8eda51d5-cd82-4569-bfc9-d5570cdf2126 finished in 0.30376038211397827
February 2nd 2018, 15:38:49.449 sms 8eda51d5-cd82-4569-bfc9-d5570cdf2126 created at 2018-02-02 15:38:48.439113
February 2nd 2018, 15:38:49.449 sms 8eda51d5-cd82-4569-bfc9-d5570cdf2126 sent to the priority-tasks queue for delivery
To somthing like this:
February 2nd 2018, 15:39:05.885 Firetext callback return status of 0 for reference: 8eda51d5-cd82-4569-bfc9-d5570cdf2126
February 2nd 2018, 15:38:49.449 sms 8eda51d5-cd82-4569-bfc9-d5570cdf2126 created at 2018-02-02 15:38:48.439113
needed for monitoring the performance of the v2 endpoints. They were put
in as a temporary measure whilst sustained performance testing was
taking place.
- Disable Redis as there is a current connection limit of 256 which
could slow down the request if they are all used
- Added statd to methods in the post to help spot any bottlenecks
In other places the text we use for this error message is "Missing personalisation: name, date, thing". See:
- 72b108b694/app/template/rest.py (L125)
- 717c0510a3/app/notifications/rest.py (L206)
- 05a179c6ef/app/v2/template/post_template.py (L38)
For some reason this part of the codebase says "Template missing personalisation: …". This is inconsistent, and also confusing because it’s the API call that’s missing the personalisation, not the template itself.
This commit changes the error message to be consistent with the majority of the codebase, which uses the less confusing wording.
we now no longer create a job. At the end of the post there is no
action, as we don't have any tasks to queue immediately - if it's a
real notification it'll get picked up in the evening scheduled task.
If it's a test notification, we create it with an initial status of
sending so that we can be sure it'll never get picked up - and then we
trigger the update-letter-notifications-to-sent-to-dvla task to sent
the sent-at/by.
there are three steps to this
1. Create a job
* Starts in status 'ready to send'
* Created by None, since it's from the API
* original file name 'letter submitted via api'
2. Create a single notification for that job
* job_row_number 0
* client reference if provided
* address line 1 as recipient
3. Trigger the build_dvla_file task
we know that all the notifications have been created for this job
(since we just created them ourselves synchronously), so this will
just create the dvla-format file for the job, and upload it to s3.
when functions get as big as that, it's confusing to try and work out what
things are what. By including a * as the first arg, we require that anyone
calling the function has to use kwargs to reference the parameters
* Alter config so an error will be raised if you forget to mock out a
celery call in one of your tests
* Remove an unneeded exception type that was masking errors
- uses new utils methods to validate phone numbers
- defaults to International=True on validation. This ensures the validator works on all numbers
- Then check if the user can send this message to the number internationally if needed.
- uses the reference field on the notifications table to store a 16char random string used to cross reference DVLA letters back to the notification
- used as letter barcode does not have space for a UUID notification id
Depends on https://github.com/alphagov/notifications-utils/pull/149
Renamed the numeric_id to notification_reference in utils and changed validation rules to match this
Note also the persist_notification method set "reference" to be "client_reference" which is confusing and they are different things, so fixed this too.
This is being done for the PaaS migration to allow us to keep traffic coming in whilst we migrate the database.
uses the same tasks as the CSV uploaded notifications. Simple changes to not persist the notification, and call into a different task.
> If a user makes an API request with additional personalisation fields,
> we should simply discard any fields that the template doesn't have.
>
> This gives a couple of related advantages:
>
> - modifying template parameters no longer requires downtime for
> clients - as they can pass in extra new parameters before a template
> change, or continue passing in old unused parameters after removing
> them from a template
>
> - services can pass in large user objects, for example, and then play
> around with templates adding and removing fields at will
>
> we should make sure we still return an error if a user doesn't pass in
> a required parameter.
– https://www.pivotaltracker.com/story/show/140774195
Cache expires every 10 minutes, but will help with the every 2 second query, especially when a job is running.
There is some clean up and qa to do for this yet