and return data for one more day.
we're not really limiting to 7 days - we're returning 7 entire days,
plus whatever time has elapsed since midnight today. I felt it would be
best to rename the variable to `whole_days` to imply that it's not
"limit this data set to seven days", it's "give me at least seven days".
the endpoint is backwards compatible so we can rename the variable on the front-end later
By adding the service id, the query performance has improved greatly. It went from 6200ms to 0.04ms.
This should stop the 500s when a template is deleted.
to get the data for a day can be reasonably slow (a few hundred
milliseconds), and if someone's viewing a service with no activity we
don't want to do that query seven times every two seconds. So if there
is no data in redis, when we get the data out of the database, we
should put it in redis so we can just grab it from there next time.
This'll happen in two cases:
* redis data is deleted
* the service sent no messages that day
additionally, make sure that we convert nicely from redis' return
values (ascii strings) to unicode keys and integer counts.
New redis keys are partitioned per service per day. New process is as
follows:
* require a count of days to filter by. Currently admin always gives 7.
* for each day, check and see if there's anything in redis. There won't
be if either a) redis is/was down or b) the service didn't send any
notifications that day
- if there isn't, go to the database and get a count out.
* combine all these stats together
* get the names/template types etc out of the DB at the end.
we should be very careful with when we get data from
NotificationHistory - this should probably only be from scheduled
tasks. `dao_get_template_usage` is only called from the template
statistics rest endpoint, so shouldn't ever hit template history.
also, moved tests out to new file to break up the 2k test file a bit
We've run into issues with redis expiring keys while we try and write
to them - short lived redis TTLs aren't really sustainable for keys
where we mutate the state. Template usage is a hash contained in redis
where we increment a count keyed by template_id each time a message is
sent for that template. But if the key expires, hincrby (redis command
for incrementing a value in a hash) will re-create an empty hash.
This is no good, as we need the hash to be populated with the last
seven days worth of data, which we then increment further. We can't
tell whether the hincrby created the key, so a different approach
entirely was needed:
* New redis key: <service_id>-template-usage-<YYYY-MM-DD>. Note: This
YYYY-MM-DD is BTC time so it lines up nicely with ft_billing table
* Incremented to from process_notification - if it doesn't exist yet,
it'll be created then.
* Expiry set to 8 days every time it's incremented to.
Then, at read time, we'll just read the last eight days of keys from
Redis, and sum them up. This works because we're only ever incrementing
from that one place - never setting wholesale, never recreating the
data from scratch. So we know that if the data is in redis, then it is
good and accurate data.
One thing we *don't* know and *cannot* reason about is what no key in
redis means. It could be either of:
* This is the first message that the service has sent today.
* The key was deleted from redis for some reason.
Since we set the TTL to so long, we'll never be writing to a key that
previously expired. But if there is a redis (or operator) error and the
key is deleted, then we'll have bad data - after any data loss we'll
have to rebuild the data.
This PR will optimize this query to use a more efficient index.
- Add notification_type to the dao_get_last_template_usage to optimize the query.
- Tested and analyzed query on production database with very significant results.
Before:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..1711.35 rows=1 width=935) (actual time=21186.053..21186.053 rows=0 loops=1)
-> Index Scan Backward using ix_notifications_created_at on notifications (cost=0.43..4607493.80 rows=2693 width=935) (actual time=21186.052..21186.052 rows=0 loops=1)
Filter: (((key_type)::text <> 'test'::text) AND (template_id = 'xxxxxx'::uuid))
Rows Removed by Filter: 8244071
Planning time: 0.112 ms
Execution time: 21186.082 ms
After:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Limit (cost=5323.10..5323.10 rows=1 width=935)
-> Sort (cost=5323.10..5323.74 rows=258 width=935)
Sort Key: created_at DESC
-> Index Scan using ix_notifications_template_id on notifications (cost=0.56..5321.81 rows=258 width=935)
Index Cond: (template_id = 'xxxxx'::uuid)
Filter: (((key_type)::text <> 'test'::text) AND (notification_type = 'sms'::notification_type))
Planning time: 1.102 ms
Execution time: 0.584 ms
Cache expires every 10 minutes, but will help with the every 2 second query, especially when a job is running.
There is some clean up and qa to do for this yet
or param errors to raise invalid data exception. That will cause
those responses to be handled in by errors.py, which will log
the errors.
Set most of schemas to strict mode so that marshmallow will raise
exception rather than checking for errors in return tuple from load.
Added handler to errors.py for marshmallow validation errors.
they do not raise exceptions.
Introduced a simple exception that contains error messages and status
code that can be used rather than return json + status code from rest
methods directly.
The handler in errors for this exception can then log the error before
returning json.