Commit Graph

172 Commits

Author SHA1 Message Date
Ben Thorner
966c4db8c6 Fix getting service IDs for status aggregation
Addresses [1].

Previously the query would always use UTC midnight, even after we
had switched to BST (+1h). We store timestamps as naive UTC in our
DB - without a timezone - but we want the query to work in terms
of GMT / BST so we adjust for that - BST midnight is 11PM in UTC.

[1]: https://github.com/alphagov/notifications-api/pull/3437#discussion_r791998690
2022-02-10 10:51:45 +00:00
David McDonald
1d8fafcdf4 Remove unused functions
Can't see these being used anywhere so lets get rid of them
2022-02-07 15:58:04 +00:00
Rebecca Law
6cd7a23d3c If there is an invalid letter that has not been updated to validation-failed because the update-validation-failed-for-templated-letter has not been picked up off the letter-tasks queue and the collate-letter-pdfs-to-be-sent has started.
1. The number of letters that we send to DVLA will be not be correct (see 20ead82463/app/celery/letters_pdf_tasks.py (L136))
This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported.
2. When we get the PDF from S3 we will get a file not found 20ead82463/app/celery/letters_pdf_tasks.py (L244)
The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions.

Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.
2022-01-19 08:31:19 +00:00
Ben Thorner
de9ae08ecc Downgrade log for letter deletion exceptions
If the S3 object is missing [1], then that's what we want, so we
don't need such a severe log for it, but we still want to know as
it's not expected. This is separate to more general "ClientError"
exceptions, which could mean anything.

There weren't any tests to cover missing S3 objects, so I've added
one. I don't think we need a test for ClientErrors:

- If there was no handler, the task would fail and we'd learn about
it that way.

- The scope of the calling task is now much smaller, so it matters
less than it used to [2].

[1]: 81a79e56ce/app/letters/utils.py (L52)
[2]: f965322f25
2021-12-20 12:45:48 +00:00
Leo Hemsted
0dc0e184b9 clean up and rewrite notification_dao_delete_notifications
a bunch of these tests are now covered in the task test, so got rid of
some. Now that the "how long ago to delete" questions is asked in the
task rather than in the dao, and only one service is looked at at a
time, we don't need to worry about data retention, etc. Hopefully made
the tests simpler - there may still be some duplicates or overlaps
between the various cases.
2021-12-14 15:24:35 +00:00
Ben Thorner
c1f0c24d82 Trim down tests for DAO timeout function a bit
The first test is enough to cover that "created" and "delivered"
notifications aren't affected by this function.
2021-12-13 17:17:41 +00:00
Ben Thorner
76aeab24ce Rewrite DAO timeout method to take cutoff_time
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
2021-12-13 16:56:21 +00:00
Ben Thorner
b81a66da50 Fix assertions in tests for timeout DAO function
Previously most of the assertions were being run *before* we had
actually called the function. There was also a redundant block of
assertions that just asserted the initial state of the test data.
2021-12-13 16:48:30 +00:00
Ben Thorner
2fb432adaf Merge pull request #3383 from alphagov/email-sms-created-alert-180344153
Add new log / alert for 'created' email / SMS
2021-12-13 12:56:05 +00:00
David McDonald
7d8eed8228 Optimise queries run for creating pagination links
We have been running in to the problem in
pallets/flask-sqlalchemy#518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.

As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.

This commit is analagous to
c68d1a2f23

The only difference is that in that case, the pagination links are
used to show prev and/or next links in the admin app. In this case,
the pagination links are only used to see if there is a page 2, and
if there is, say that we are only showing the first 50 results.
2021-12-10 17:47:27 +00:00
Ben Thorner
ab4cb029df Remove alert for email / sms in created
In response to [1].

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988

It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.

It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
2021-12-06 14:11:42 +00:00
Ben Thorner
0318229216 Stop 'timing out' old 'created' notifications
This is being replaced with a new alert and runbook [1]. It's not
always appropriate to change the status to 'technical-failure', and
the new alert means we'll act to fix the underlying issue promptly.

We'll look at tidying up the remaining code in the next commits.

[1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created
2021-12-06 14:00:36 +00:00
Ben Thorner
f96ba5361a Add new task to alert about created email / sms
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.

In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
2021-12-06 14:00:31 +00:00
Leo Hemsted
bab659c677 reduce number of services we try and delete notifications for
TLDR: Don't return as many services, and only return their IDs and not
the whole service objects.

Context:

the delete notifications nightly task has been taking longer and longer,
and to delete all three notification types in sequence it now takes up
to 8 hours.

This is because we were retrieving all services, loading them into
memory on the worker, and then trying to delete notifications for each
service in turn.

While it does use a fair chunk of IOPS/CPU on our postgres db, we're not
anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1]

The real issue appears to be that the task is CPU bound on the periodic
worker - we see the worker spike up to 100% CPU regularly across the
whole 3am-11am period.

We also noticed that for each notification type the task first processes
services with custom data retention (not many but some of the biggest
users), then deals with all other services. We can see from looking at
kibana that, for example, the task starts at 3am, and the custom data
retention service email deletions are finished by 3:12am. The rest of
the emails don't get deleted until 5am, so we knew that the problem is
with how it handles the other services.

There are currently 17000 services in the database. On a typical day,
~800 services will have notifications that are over 7 days old and need
to be deleted. By only returning these services, we reduce the amount of
data transfer and serialisation that needs to happen. It takes about two
minutes to retrieve the distinct service ids from the notifications
table for sms notifications, but that is only 5% the size of the full
list so cuts down on a lot of processing

Also, by only returning service_ids rather than the whole `Service`
model we avoid sqlalchemy needing to do lots of data serialisation, when
we were only using the `Service.id` field from that result anyway.

[1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics
[2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now
[3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))
2021-11-24 16:18:40 +00:00
Ben Thorner
8219b3c032 Remove non/crown indicator in letter filenames
This is not required by DVLA and since [1] we no longer care about
the end of letter filenames when collating them, so removing it is
safe to do. Note that the name of the ZIP files of collated letters
is based on a hash of the filenames, which needed updating in tests.

Before merging this we need to do a test run in Staging, so DVLA can
check that a mixture of the old / new filenames won't cause issues.

[1]: https://github.com/alphagov/notifications-api/pull/3172
2021-03-18 13:05:12 +00:00
Ben Thorner
c76e789f1e Reduce extra S3 ops when working with letter PDFs
Previously we did some unnecessary work:

- Collate task. This had one S3 request to get a summary of the object,
which was then used in another request to get the full object. We only
need the size of the object, which is included in the summary [1].

- Archive task. This had one S3 request to get a summary of the object,
which was then used to make another request to delete it. We still need
both requests, but we can remove the S3.Object in the middle.

[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary
2021-03-16 12:53:13 +00:00
Ben Thorner
ff7eebc90a Simplify deleting old letters
Previously we made a call to S3 to list objects for a letter, even
though we already had the precise key of the single object to hand.
This removes the one usage of "get_s3_bucket_objects" and uses the
filename directly in the call to remove the object.
2021-03-15 17:18:20 +00:00
Ben Thorner
b43a367d5f Relax lookup of letter PDFs in S3 buckets
Previously we generated the filename we expected a letter PDF to be
stored at in S3, and used that to retrieve it. However, the generated
filename can change over the course of a notification's lifetime e.g.
if the service changes from crown ('.C.') to non-crown ('.N.').

The prefix of the filename is stable: it's based on properties of the
notification - reference and creation - that don't change. This commit
changes the way we interact with letter PDFs in S3:

- Uploading uses the original method to generate the full file name.
The method is renamed to 'generate_' to distinguish it from the new one.

- Downloading uses a new 'find_' method to get the filename using just
its prefix, which makes it agnostic to changes in the filename suffix.

Making this change helps to decouple our code from the requirements DVLA
have on the filenames. While it means more traffic to S3, we rely on S3
in any case to download the files. From experience, we know S3 is highly
reliable and performant, so don't anticipate any issues.

In the tests we favour using moto to mock S3, so that the behaviour is
realistic. There are a couple of places where we just mock the method,
since what it returns isn't important for the test.

Note that, since the new method requires a notification object, we need
to change a query in one place, the columns of which were only selected
to appease the original method to generate a filename.
2021-03-15 13:55:44 +00:00
David McDonald
41d95378ea Remove everything for the performance platform
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats

We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
2021-03-15 12:04:53 +00:00
Ben Thorner
a91fde2fda Run auto-correct on app/ and tests/ 2021-03-12 11:45:45 +00:00
Rebecca Law
19f7a6ce38 Refactor method for deciding the failure type 2021-03-10 14:39:55 +00:00
Pea Tyczynska
e0c73ac342 Send daily email with letter and sheet volumes to DVLA 2021-02-23 15:13:19 +00:00
Pea Tyczynska
c8ffebcce8 Query to get letter and sheet volumes
So we can send daily email with these volumes to DVLA.
2021-02-23 15:13:18 +00:00
David McDonald
9aba3d758b Fix test that fails after 5:30pm
Was failing when ran after 5:30pm as this would cause the letters to be
in a different subfolder (for one day later). Solved by freezetiming it

Example build that failed: https://cd.gds-reliability.engineering/builds/1876957
2020-12-24 09:57:52 +00:00
Chris Hill-Scott
3b0b96834d Do extra code style checks with flake8-bugbear
Flake8 Bugbear checks for some extra things that aren’t code style
errors, but are likely to introduce bugs or unexpected behaviour. A
good example is having mutable default function arguments, which get
shared between every call to the function and therefore mutating a value
in one place can unexpectedly cause it to change in another.

This commit enables all the extra warnings provided by Flake8 Bugbear,
except for:
- the line length one (because we already lint for that separately)
- B903 Data class should either be immutable or use `__slots__` because
  this seems to false-positive on some of our custom exceptions
- B902 Invalid first argument 'cls' used for instance method because
  some SQLAlchemy decorators (eg `declared_attr`) make things that
  aren’t formally class methods take a class not an instance as their
  first argument

It disables:
- _B306: BaseException.message is removed in Python 3_ because I think
  our exceptions have a custom structure that means the `.message`
  attribute is still present

Matches the work done in other repos:
- https://github.com/alphagov/notifications-admin/pull/3172/files
2020-12-22 16:26:45 +00:00
David McDonald
e35ea57ba2 Do not delete letters if not in final state
A few weeks ago, we deleted some pdf letters that had reached their
retention period. However, these letters were in the 'created' state so
it's very arguable that we should not have deleted them because we were
expecting to resend them and were unable to. Part of the reason for this
is that we marked the letters back to `created` as the status but we did
not nullify the `sent_at` timestamp, meaning the check on
ebb43082d5/app/dao/notifications_dao.py (L346)
did not catch it. Regardless of that check, which controls whether the
files were removed from S3, they were also archived into the
`notification_history` table as by default.

This commit does changes our code such that letters that are not in
their final state do not go through our retention process. This could
mean they violate their retention policy but that is likely the lesser
of two evils (the other being we delete them and are unable to resend
them).

Note, `sending` letters have been included in those not to be removed
because there is a risk that we give the letter to DVLA and put it in
`sending` but then they come back to us later telling us they've had
problems and require us to resend.
2020-12-16 10:50:11 +00:00
David McDonald
1bf9b29905 Document behaviour of s3 letter deleting
The behaviour was a bit of opaque so I have added tests around it so
it's clear what it is doing and why. No functionality has changed
2020-12-16 10:39:31 +00:00
David McDonald
219023f4c6 Fix test that was passing unintentionally
This test was added in
ebb43082d5

However, there are a few problems with it

1. The test name doesn't seem to match what the code is doing. It looks
   like instead that it is NOT trying to delete from s3 when the letter
   has not been sent and therefore I've updated the test name as such.

2. `delete_notifications_older_than_retention_by_type` was being called
   with `email` as it's argument which doesn't match. This is a test for
   letters. It definitely wouldn't do any looking in s3 for emails.

3. `created_at` needed bumping back into the past, past the default 7
   days retention so these letters would be considered old enough to
   delete

4. For the letter to not be sent, it needs to be in `created`, not in
   `sending` so I have updated the status. Note, there could be other
   statuses which class as 'not sent' but this is the most obvious one
   to test with
2020-12-15 09:48:08 +00:00
Leo Hemsted
3bc3ed88b3 use yield_per instead of limit
limit means we only return 50k letters, if there are more than that for
a service we'll skip them and they won't be picked up until the next
day.

If you remove the limit, sqlalchemy prefetches query results so it can
build up ORM results, for example collapsing joined rows into single
objects with chidren. SQLAlchemy streams the data into a buffer, and
normally will still prefetch the entire resultset so it can ensure
integrity of the session, (so that if you modify one result that is
duplicated further down in the results, both rows are updated in the
session for example). However, we don't care about that, but we do care
about preventing the result set taking up too much memory. We can use
`yield_per` to yield from sqlalchemy to the iterator (in this case the
`for letter in letters_awaiting_sending` loop in letters_pdf_tasks.py) -
this means every time we hit 10000 rows, we go back to the database to
get the next 10k. This way, we only ever need 10k rows in memory at a
time.

This has some caveats, mostly around how we handle the data the query
returns. They're a bit hard to parse but I'm pretty sure the notable
limitations are:

* It's dangerous to modify ORM objects returned by yield_per queries
* It's dangerous to join in a yield_per query if you think there will be
  more than one row per item (for example, if you join from notification
  to service, there'll be multiple result rows containing the same
  service, and if these are split over different yield chunks, then we
  may experience undefined behaviour.

These two limitations are focused around there being no guarantee of
having one unique row per item.

For more reading:
https://docs.sqlalchemy.org/en/13/orm/query.html?highlight=yield_per#sqlalchemy.orm.query.Query.yield_per
https://www.mail-archive.com/sqlalchemy@googlegroups.com/msg12443.html
2020-10-26 13:01:34 +00:00
Leo Hemsted
ed182c2a22 return just the columns we need for collating letters
previously we were returning the entire ORM object. Returning columns
has a couple of benefits:

* Means we can join on to services there and then, avoiding second
  queries to get the crown status of the service later in the collate
  flow.
* Massively reduces the amount of data we return - particularly free
  text fields like personalisation that could be potentially quite big.
  5 columns rather than 26 columns.
* Minor thing, but will skip some CPU cycles as sqlalchemy will no
  longer construct an ORM object and try and keep track of changes. We
  know this function doesn't change any of the values to persist them
  back, so this is an unnecessary step from sqlalchemy.

Disadvantages are:

* The dao_get_letters_to_be_printed return interface is now much more
  tightly coupled to the get_key_and_size_of_letters_to_be_sent_to_print
  function that calls it.
2020-10-23 20:01:18 +01:00
Leo Hemsted
4b61060d32 stream notifications when collating zip files
we had issues where we had 150k 2nd class notifications, and the collate
task never ran properly, presumably because the volume of data being
returned was too big.

to try and help with this, we can switch to streaming rather than using
`.all` and building up lists of data. This should help, though the
initial query may be a problem still.
2020-10-23 12:20:26 +01:00
Pea Tyczynska
9ac65ee95c Start sending letters from insolvency service again 2020-10-21 16:56:18 +01:00
Pea Tyczynska
30bd311eb1 Temporarily do not send letters from Insolvency Service
to DVLA. This is a temporary measure over the weekend so that
DVLA can catch up with all other letters.

We should revert this on Monday 19.10.2020
2020-10-16 16:13:32 +01:00
Rebecca Law
b2ff4277c9 Adding service_id to the sort order for the letters being sent to print.
We have had a few instances where letters have caused problems. Particularly for precompiled letters, often the issue comes from the same service.
The hope is that by adding a sort order this will help the print provider narrow down the problem.

There is a small degradation of the performance of the query, but it's not enough to concern me.
2020-10-15 09:39:07 +01:00
Chris Hill-Scott
cfda289746 Allow international letters to be cancelled
Our code was assuming that any notifications with `international` set to
`True` were text messages. It was then trying to look up delivery
information for a notification which wasn’t sent to a phone number,
causing an exception.
2020-09-09 10:55:55 +01:00
Pea Tyczynska
fff004da2b Turn statsd back on, but not for api main app, only for delivery
workers.
2020-07-09 10:10:24 +01:00
Pea Tyczynska
9c4205c7c6 Remove statsd decorators from dao functions
This done so that we do not use statsd on our http endpoint.
We decided we do not need metrics that this gave us. If we
change our minds, we will add Prometheus-friendly decorators
instead in the future.
2020-07-07 18:02:24 +01:00
Rebecca Law
ce32e577b7 Remove the use of schedule_for in post_notifications.
Years ago we started to implement a way to schedule a notification. We hit a problem but we never came up with a good solution and the feature never made it back to the top of the priority list.

This PR removes the code for scheduled_for. There will be another PR to drop the scheduled_notifications table and remove the schedule_notifications service permission

Unfortunately, I don't think we can remove the `scheduled_for` attribute from the notification.serialized method because out clients might fail if something is missing. For now I have left it in but defaulted the value to None.
2020-06-24 14:54:40 +01:00
Chris Hill-Scott
80fc5e6600 Paginate search results for notifications
The standard way that we indicate that there are more results than can
be returned is by paginating. So even though we don’t intend to paginate
the search results in the admin app, it can still use the presence or
absence of a ‘next’ link to determine whether or not to show a message
about only showing the first 50 results.
2020-05-06 12:13:00 +01:00
Chris Hill-Scott
625aad97c9 Limit search by recipient to 50 results
Things could get ugly if you use a short search string on a service with
lots of notifications…
2020-05-06 11:44:35 +01:00
Chris Hill-Scott
1e20cb6be9 Stop searching the to field
We were doing this temporarily while the `normalised_to` field was not
populated for letters. Once we have a week of data in the
`normalised_to` field we can stop looking in the `to` field. This should
improve performance because:
- it’s one `WHERE` clause not two
- we had to do `ilike` on the `to` field, because we don’t lowercase its
  contents – even if the two where clauses are somehow paralleized it’s
  is slower than a `like` comparison, which is case-sensitive

Depends on:
- [ ] Tuesday 5 May (7 full days after deploying https://github.com/alphagov/notifications-api/pull/2814)
2020-04-30 10:51:30 +01:00
Chris Hill-Scott
7897ae70ce Let users search for letters
Like we have search by email address or phone number, finding an
individual letter is a common task. At the moment users are having to
click through pages and pages of letters to find the one they’re looking
for.

We have to search in the `to` and `normalised_to` fields for now because
we’re not populating the `normalised_to` column for letters at the
moment.
2020-04-21 16:25:37 +01:00
Katie Smith
bfd40a843c Delete send-scheduled-notifications task
This was added a few years ago, but never used.
2020-04-02 14:45:18 +01:00
Rebecca Law
1444150d48 Add team key to the notifications to delete/insert into notifications/notification_history 2020-03-31 09:23:41 +01:00
Rebecca Law
5d0830d7f7 Rename function for clarity 2020-03-27 15:48:54 +00:00
Rebecca Law
46b72a6bbb Refactor process_ses_receipt
It is possible a service has data rention that is smaller than the time it takes to get a delivery receipt.
This PR refactors process_ses_receipt to update NotificationHistory if the Notifcation has already been purged.
2020-03-27 14:12:39 +00:00
Rebecca Law
8da73510c2 Delete all notification_status types.
This will ensure the whole day has been deleted. The stats table could get the wrong updates if there is partial data for a day.
2020-03-27 12:14:30 +00:00
Rebecca Law
69102db4cb Remove the hourly loop and replace it with a while loop, which check if we still have things to delete for the day.
We are already limiting the query by a query limit.
2020-03-24 14:54:06 +00:00
Rebecca Law
3ca73a35ed Change as per code review
- fix test name
 - make query filter consistent with each other
 - add comment for clarity
 - add inner loop to continue to insert and delete notifications while the delete count is greater than 0
2020-03-24 14:17:47 +00:00
Rebecca Law
f04210ba7d Refactor delete_notifications_older_than_retention_by_type to use a new strategy.
The insert_notification_history_delete_notifications function uses a temp table to store the data to insert and delete. This will save extra queries while performing the insert and delete operations.
The function is written in such a way that if the task is stop while processing when it's started up again it will just pick up where it left off.

I've made a decision to delete all test data in one query, I don't anticipate a problem with that.

The performance of this might also be better than last nights test because we are inserting everything we need for the NotificationHistory insert, so we don't need the join to Notifications to perform the insert.
2020-03-24 12:21:28 +00:00