Commit Graph

1187 Commits

Author SHA1 Message Date
Ben Thorner
a69d1635a1 Update FactStatus table in bulk for each service
Previously we were looping over data from the Notifications/History
table and then shovelling it into the status table, one row at a time
- plus an extra delete to clean up any existing data.

This replaces that with a batch insertion, similar to how we archive
notifications [1], but using a simple subquery (via "from_select" [2])
instead of a temporary table.

To make the select compatible with the insert, I've used "literal"
to inject the constant pieces of data, so each row has everything it
needs to go into the status table.

[1]: 9ce6d2fe92/app/dao/notifications_dao.py (L295)
[2]: https://docs.sqlalchemy.org/en/14/core/dml.html#sqlalchemy.sql.expression.Insert.from_select
2022-02-16 13:40:05 +00:00
Ben Thorner
966c4db8c6 Fix getting service IDs for status aggregation
Addresses [1].

Previously the query would always use UTC midnight, even after we
had switched to BST (+1h). We store timestamps as naive UTC in our
DB - without a timezone - but we want the query to work in terms
of GMT / BST so we adjust for that - BST midnight is 11PM in UTC.

[1]: https://github.com/alphagov/notifications-api/pull/3437#discussion_r791998690
2022-02-10 10:51:45 +00:00
Ben Thorner
6e8f121548 Standardise how we query midnight-to-midnight
Partially addresses [1] (lots more detail to read in the comment).
I've also added some tests for the status DAO function to confirm
it behaves as expected across timezones.

[1]: https://github.com/alphagov/notifications-api/pull/3437#discussion_r802634913
2022-02-10 10:51:27 +00:00
David McDonald
1d8fafcdf4 Remove unused functions
Can't see these being used anywhere so lets get rid of them
2022-02-07 15:58:04 +00:00
Rebecca Law
09c8fbe982 Merge pull request #3418 from alphagov/letters-too-long
Mark letters as validation-failed if the templated letter is too long.
2022-02-02 08:30:50 +00:00
Rebecca Law
6cd7a23d3c If there is an invalid letter that has not been updated to validation-failed because the update-validation-failed-for-templated-letter has not been picked up off the letter-tasks queue and the collate-letter-pdfs-to-be-sent has started.
1. The number of letters that we send to DVLA will be not be correct (see 20ead82463/app/celery/letters_pdf_tasks.py (L136))
This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported.
2. When we get the PDF from S3 we will get a file not found 20ead82463/app/celery/letters_pdf_tasks.py (L244)
The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions.

Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.
2022-01-19 08:31:19 +00:00
Ben Thorner
9fc8b904c6 DRY up status aggregation tests (move DAO tests up)
The previous DAO tests were also confusing because they were testing
two functions at the same time, so moving the tests up to the task
level seems very reasonable, and will make it easier to change how
this code works in the next commits.
2022-01-11 16:11:36 +00:00
Ben Thorner
081e0cab88 Merge pull request #3417 from alphagov/optimise-status-query-180693991
Optimise query to populate notification statuses
2022-01-11 14:18:36 +00:00
Rebecca Law
2257cae398 Fix bug in organisation report for its services and usages.
If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing.

There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.
2022-01-11 10:04:36 +00:00
Ben Thorner
394bf9abd9 Extend test for updating fact statuses
This covers that we only exclude test notifications and the key
type is copied over correctly. In the next commits we're going to
modify this part of the query, so it's important it's covered.
2022-01-05 16:49:30 +00:00
Pea Tyczynska
32cd7a0eb6 Merge pull request #3395 from alphagov/fix_org_usage_report
Fix calculating remaining free allowance for SMS
2021-12-21 15:02:54 +00:00
Pea Tyczynska
d334e405c5 Refactor tests for sms remainder to make them easier to read 2021-12-21 14:43:56 +00:00
Ben Thorner
de9ae08ecc Downgrade log for letter deletion exceptions
If the S3 object is missing [1], then that's what we want, so we
don't need such a severe log for it, but we still want to know as
it's not expected. This is separate to more general "ClientError"
exceptions, which could mean anything.

There weren't any tests to cover missing S3 objects, so I've added
one. I don't think we need a test for ClientErrors:

- If there was no handler, the task would fail and we'd learn about
it that way.

- The scope of the calling task is now much smaller, so it matters
less than it used to [2].

[1]: 81a79e56ce/app/letters/utils.py (L52)
[2]: f965322f25
2021-12-20 12:45:48 +00:00
Leo Hemsted
0dc0e184b9 clean up and rewrite notification_dao_delete_notifications
a bunch of these tests are now covered in the task test, so got rid of
some. Now that the "how long ago to delete" questions is asked in the
task rather than in the dao, and only one service is looked at at a
time, we don't need to worry about data retention, etc. Hopefully made
the tests simpler - there may still be some duplicates or overlaps
between the various cases.
2021-12-14 15:24:35 +00:00
Ben Thorner
c1f0c24d82 Trim down tests for DAO timeout function a bit
The first test is enough to cover that "created" and "delivered"
notifications aren't affected by this function.
2021-12-13 17:17:41 +00:00
Ben Thorner
76aeab24ce Rewrite DAO timeout method to take cutoff_time
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
2021-12-13 16:56:21 +00:00
Ben Thorner
b81a66da50 Fix assertions in tests for timeout DAO function
Previously most of the assertions were being run *before* we had
actually called the function. There was also a redundant block of
assertions that just asserted the initial state of the test data.
2021-12-13 16:48:30 +00:00
Ben Thorner
2fb432adaf Merge pull request #3383 from alphagov/email-sms-created-alert-180344153
Add new log / alert for 'created' email / SMS
2021-12-13 12:56:05 +00:00
David McDonald
7d8eed8228 Optimise queries run for creating pagination links
We have been running in to the problem in
pallets/flask-sqlalchemy#518 where
our page loads very slow when viewing a single page of notifications
for a service in the admin app. Tracing this back and using SQL
explain analyze I can see that getting the notifications takes about
a second but the second query to count how many notifications there
are (to work out if there is a next page of pagination) can take up
to 100 seconds.

As suggested in that issue, we do the pagination ourselves.
Our pagination doesn't need us to know exactly how many notifications
there are, just whether there are any on the next page and that can
be done without running the slow query to count how many
notifications in total by using `count_pages=False`.

This commit is analagous to
c68d1a2f23

The only difference is that in that case, the pagination links are
used to show prev and/or next links in the admin app. In this case,
the pagination links are only used to see if there is a page 2, and
if there is, say that we are only showing the first 50 results.
2021-12-10 17:47:27 +00:00
Pea Tyczynska
a74d1b026f Fix calculating remaining free allowance for SMS
The way it was done before, the remainder was incorrect in the
billing report and in the org usage query - it was the sms remainder
left at the start of the report period, not at the end of that period.

This became apparent when we tried to show sms_remainder on the org
usage report, where start date is always the start of the financial year.
We saw that sms sent by services did not reduce their free allowance
remainder according to the report. As a result of this, we had to
temporarily remove of sms_remainder column from the report, until
we fix the bug - it has been fixed now, yay!

I think the bug has snuck in partially because our fixtures for testing
this part of the code are quite complex, so it was
harder to see that numbers don't add up. I have added comments
to the tests to try and make it a bit clearer why the results are
as they are.

I also added comments to the code, and renamed some variables,
to make it easier to understand, as there are quite a few
moving parts in it - subqueries and the like.
I also renamed the fetch_sms_free_allowance_remainder method to
fetch_sms_free_allowance_remainder_until_date so it is clearer
what it does.
2021-12-09 18:58:10 +00:00
Ben Thorner
ab4cb029df Remove alert for email / sms in created
In response to [1].

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988

It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.

It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
2021-12-06 14:11:42 +00:00
Ben Thorner
0318229216 Stop 'timing out' old 'created' notifications
This is being replaced with a new alert and runbook [1]. It's not
always appropriate to change the status to 'technical-failure', and
the new alert means we'll act to fix the underlying issue promptly.

We'll look at tidying up the remaining code in the next commits.

[1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created
2021-12-06 14:00:36 +00:00
Ben Thorner
f96ba5361a Add new task to alert about created email / sms
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.

In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
2021-12-06 14:00:31 +00:00
Leo Hemsted
bab659c677 reduce number of services we try and delete notifications for
TLDR: Don't return as many services, and only return their IDs and not
the whole service objects.

Context:

the delete notifications nightly task has been taking longer and longer,
and to delete all three notification types in sequence it now takes up
to 8 hours.

This is because we were retrieving all services, loading them into
memory on the worker, and then trying to delete notifications for each
service in turn.

While it does use a fair chunk of IOPS/CPU on our postgres db, we're not
anywhere close to capacity on those (20% CPU, 4k IOPS out of 30k max)[1]

The real issue appears to be that the task is CPU bound on the periodic
worker - we see the worker spike up to 100% CPU regularly across the
whole 3am-11am period.

We also noticed that for each notification type the task first processes
services with custom data retention (not many but some of the biggest
users), then deals with all other services. We can see from looking at
kibana that, for example, the task starts at 3am, and the custom data
retention service email deletions are finished by 3:12am. The rest of
the emails don't get deleted until 5am, so we knew that the problem is
with how it handles the other services.

There are currently 17000 services in the database. On a typical day,
~800 services will have notifications that are over 7 days old and need
to be deleted. By only returning these services, we reduce the amount of
data transfer and serialisation that needs to happen. It takes about two
minutes to retrieve the distinct service ids from the notifications
table for sms notifications, but that is only 5% the size of the full
list so cuts down on a lot of processing

Also, by only returning service_ids rather than the whole `Service`
model we avoid sqlalchemy needing to do lots of data serialisation, when
we were only using the `Service.id` field from that result anyway.

[1] https://admin.cloud.service.gov.uk/organisations/55b1eb7d-e4c5-4359-9466-dd3ca5b0e457/spaces/80d769ff-7b01-49a4-9fa4-f87edd5328f9/services/6093d337-6918-4b97-9709-97529114eb90/metrics
[2] https://grafana-paas.cloudapps.digital/d/_GlGBNbmk/notify-apps?orgId=2&refresh=5s&var-space=production&var-app=notify-delivery-worker-periodic&from=now-24h&to=now
[3] https://kibana.logit.io/s/9423a789-282c-4113-908d-0be3b1bc9d1d/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:now-24h,mode:quick,to:now))&_a=(columns:!(message),index:'logstash-*',interval:auto,query:(query_string:(analyze_wildcard:!t,query:'%22Deleting%20email%20notifications%20for%20services%20without%20flexible%20data%20retention%22')),sort:!('@timestamp',desc))
2021-11-24 16:18:40 +00:00
David McDonald
106187ba04 Fix division by zero error on performance page
For preview and staging environments, we often send no messages
in a single day. This is currently causing a `DivisionByZero` error
that is rendering the page with no results. This makes it impossible
to look at preview/staging and see if the performance page is
working correctly or not.

(psycopg2.errors.DivisionByZero) division by zero

[SQL: SELECT CAST(ft_processing_time.bst_date AS TEXT) AS date, ft_processing_time.messages_total AS ft_processing_time_messages_total, ft_processing_time.messages_within_10_secs AS ft_processing_time_messages_within_10_secs, (ft_processing_time.messages_within_10_secs / CAST(ft_processing_time.messages_total AS FLOAT)) * %(param_1)s AS percentage
FROM ft_processing_time
WHERE ft_processing_time.bst_date >= %(bst_date_1)s AND ft_processing_time.bst_date <= %(bst_date_2)s ORDER BY ft_processing_time.bst_date]
[parameters: {'param_1': 100, 'bst_date_1': datetime.date(2021, 11, 12), 'bst_date_2': datetime.date(2021, 11, 19)}]
(Background on this error at: http://sqlalche.me/e/14/9h9h)

I've fixed this by falling back to 100.0% for days we send
no messages. Maybe some argument that it should be N/A rather than
100% but I think it doesn't really matter as this is only
going to affect preview and staging as we will never have a day
sending no messages in production.
2021-11-22 11:11:52 +00:00
Chris Hill-Scott
19ad11e383 Don’t repeat digits in security codes
People with dyslexia and dyscalculia find it difficult to transpose
codes which have consecutive, repeated digits[1].

This commits enhances the algorithm for generating codes to not repeat
the previous digit in a code.

This reduces the key space for our codes from 100,000 possibilities to
65,610 possibilities.

1. https://twitter.com/annaecook/status/1442567679710150662
2021-09-30 10:24:17 +01:00
Ben Thorner
be7272b44f Merge pull request #3314 from alphagov/next-stage-areas-migration-178986763
Switch "areas" field to "areas_2" format
2021-09-01 16:16:30 +01:00
Chris Hill-Scott
2c7e4657ce Don’t update email_access_validated_at on password reset
As of https://github.com/alphagov/notifications-admin/pull/4000/files
the admin app is doing this, so we don’t need to do it here as well.
2021-09-01 09:54:54 +01:00
Ben Thorner
ec1171f85c Switch "areas" field to "areas_2" format
The Admin app is now temporarily using the "areas_2" field while
we migrate "areas" to the new format [1].

[1]: https://github.com/alphagov/notifications-admin/pull/4004
2021-08-27 14:22:11 +01:00
Ben Thorner
8f39d476bd Start dual running with "areas" and (area) "ids"
This is necessary until:

- The Admin app is using the new "areas(_2)" format to store and
retrieve data.

- We've migrated all existing broadcast messages to use the new
format.

Note that "areas" / "ids" isn't actually used for anything except
printing out the PagerDuty message - it's not sent to the proxy [1].

[1]: 6edc6c70aa/app/celery/broadcast_message_tasks.py (L190-L193)
2021-08-26 15:34:24 +01:00
Pea Tyczynska
9d2f8347b2 get_broadcasts returns a list of broadcasts for gov.uk/alerts 2021-08-12 14:03:33 +01:00
Pea Tyczynska
0f7f219a55 dao_get_all_broadcast_messages returns just fields govuk alerts need 2021-08-11 14:43:27 +01:00
Pea Tyczynska
74c9ca2bf6 Fetch all broadcast messages that are or were transmitted
Regardless of channel.
Do not include:
- broadcasts older than 25.05.2021
- stubbed broadcasts
- broadcasts that were not transmitted. So only broadcasting,
cancelled and completed make the list;
2021-08-11 14:43:27 +01:00
Pea Tyczynska
c28e9451d4 Bump moto version to try solve dependencies version conflict
Also update mock import statements in some test files as they
stopped working with this dependency update.
2021-07-08 15:37:19 +01:00
Rebecca Law
35b20ba363 Correct the daily limits cache.
Last year we had an issue with the daily limit cache and the query that was populating it. As a result we have not been checking the daily limit properly. This PR should correct all that.

The daily limit cache is not being incremented in app.notifications.process_notifications.persist_notification, this method is and should always be the only method used to create a notification.
We increment the daily limit cache is redis is enabled (and it is always enabled for production) and the key type for the notification is team or normal.

We check if the daily limit is exceed in many places:
 - app.celery.tasks.process_job
 -  app.v2.notifications.post_notifications.post_notification
 - app.v2.notifications.post_notifications.post_precompiled_letter_notification
 - app.service.send_notification.send_one_off_notification
 - app.service.send_notification.send_pdf_letter_notification

If the daily limits cache is not found, set the cache to 0 with an expiry of 24 hours. The daily limit cache key is service_id-yyy-mm-dd-count, so each day a new cache is created.

The best thing about this PR is that the app.service_dao.fetch_todays_total_message_count query has been removed. This query was not performant and had been wrong for ages.
2021-06-22 16:15:36 +01:00
Rebecca Law
d4a42471cb Merge pull request #3267 from alphagov/fix-daily-totals-query
Improve the query to get today's totals for a service.
2021-06-16 07:34:01 +01:00
Rebecca Law
08bb5c657f Fix the query to get todays totals for a service.
The query had a group by on notification_type and notification_status, this not only slows the query down but is wrong. The query only looked at the first result, but this query would return as many rows as different notification types and status, meaning the results do not include the correct number.

Are we concerned that all status types are included. For example letters can be cancelled or have validation-failures which shouldn't be included in the daily limit check.
2021-06-14 15:29:21 +01:00
Katie Smith
0148b3dba6 Add new total_letters field to the billing report data
This adds total_letters to the data that is returned by the
`/platform-stats/data-for-billing-report` endpoint so that we can add
total letters as a column in the CSV file that can be downloaded.
2021-06-11 11:31:22 +01:00
Rebecca Law
684a882cf3 Revert "Do not include today's totals" 2021-06-02 16:06:33 +01:00
Rebecca Law
c668bed9d3 Merge pull request #3256 from alphagov/no-totals-for-high-volume-services
Do not include today's totals
2021-06-02 15:08:45 +01:00
Rebecca Law
a341536de0 - Add comment to test and new if statement
- Update assert in test
2021-06-02 14:13:31 +01:00
Rebecca Law
b170b5ed80 This change is a temporary fix to allow users for high volume services to use the admin app.
The trouble is the aggregate query to return the big blue numbers on the dashboard and /notifications/{notification_type} page is taking too long to return.
I have some ideas on how to improve the query, but should take some time to do some more research and test. In the meantime, let's just ignore "todays" total numbers for the high volume services. There are only two services that this will affect.
2021-06-02 10:31:38 +01:00
Rebecca Law
d4b5373ee5 Update tests for BST to be on the day BST starts.
Add a test for just after midnight of the date the stats are collected.
2021-06-01 16:07:37 +01:00
David McDonald
04e23ca6a9 Revert "Bump utils version for new invalid address character" 2021-06-01 10:53:28 +01:00
Rebecca Law
dfacba5053 Add a test for a bst date 2021-05-26 11:50:05 +01:00
Rebecca Law
f66f0a2e2d Bump moto dependency to resolve dependency conflicts 2021-05-25 14:20:25 +01:00
Rebecca Law
68d28aa83b The update of SQLAlchemy 1.4.10 has caused some conflicts in our code. This PR fixes most of those conflicts.
- sqlalchemy.sql.expression.case must include an else statement.
- clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors.
- remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx.
- update queries now that he user relationship in ServiceUser db model has been removed.
- move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done.

Deleting notifications in the night tasks still needs to be
investigated. The raw sql is causing an error.
2021-04-29 13:32:36 +01:00
Pea Tyczynska
df19a91b7f Update error messages after SQLAlchemy version update 2021-04-29 13:32:36 +01:00
Rebecca Law
85895a9e8b Revert "Scheduled weekly dependency update for week 16" 2021-04-28 10:17:16 +01:00
Rebecca Law
1b070d69a1 The update of SQLAlchemy 1.4.10 has caused some conflicts in our code. This PR fixes most of those conflicts.
- sqlalchemy.sql.expression.case must include an else statement.
- clearly define list of columns for inbound_sms_history insert, getting the list from InboundSmsHistory.__table__.c was causing data type errors.
- remove relationships when not needed, the foreign key relationship is established in the creation of the column. This will get rid of the warnings referenced here: http://sqlalche.me/e/14/qzyx.
- update queries now that he user relationship in ServiceUser db model has been removed.
- move the check that a template is archived to the view instead of the dao method. The check was clearing the session before the version history could be done.

Deleting notifications in the night tasks still needs to be
investigated. The raw sql is causing an error.
2021-04-26 11:50:30 +01:00