Commit Graph

900 Commits

Author SHA1 Message Date
stvnrlly
53204c307b tests are, uh, mostly passing 2022-10-05 01:12:35 +00:00
stvnrlly
57f4df8ed1 remove broadcast-related code, except migrations 2022-10-04 15:28:27 +00:00
Ryan Ahearn
538d2cbe4c Proactively specify aws region for s3 operations 2022-09-26 10:56:59 -04:00
Ryan Ahearn
8ede076708 Use correct access credentials for each bucket 2022-09-22 12:14:25 -04:00
Ryan Ahearn
cb4036b1b0 Disable letter-based S3 buckets 2022-09-21 11:22:55 -04:00
Christa Hartsock
c4cdaed683 Skip tests that fail because of timezone handling 2022-07-07 15:41:16 -07:00
Christa Hartsock
041a892e86 Pull admin base url from test config in tests 2022-07-07 15:41:16 -07:00
Christa Hartsock
af6495cd4c Get tests passing locally
When we cloned the repository and started making modifications, we
didn't initially keep tests in step. This commit tries to get us to a
clean test run by skipping tests that are failing and removing some
that we no longer expect to use (MMG, Firetext), with the intention that
we will come back in future and update or remove them as appropriate.

To find all tests skipped, search for `@pytest.mark.skip(reason="Needs
updating for TTS:`. There will be a brief description of the work that
needs to be done to get them passing, if known. Delete that line to make
them run in a standard test run (`make test`).
2022-07-07 15:41:15 -07:00
Ben Thorner
43dbc0891f Merge pull request #3546 from alphagov/notification-view-178125825
Use notification view for status / billing tasks
2022-05-26 11:03:38 +01:00
Ben Thorner
33645c7747 Use notification view for status / billing tasks
This fixes a bug where (letter) notifications left in sending would
temporarily get excluded from billing and status calculations once
the service retention period had elapsed, and then get included once
again when they finally get marked as delivered.*

Status and billing tasks shouldn't need to have knowledge about which
table their data is in and getting this wrong is the fundamental cause
of the bug here. Adding a view across both tables abstracts this away
while keeping the query complexity the same.

Using a view also has the added benefit that we no longer need to care
when the status / billing tasks run in comparison to the deletion task,
since we will retrieve the same data irrespective (see below for a more
detailed discussion on data integrity).

*Such a scenario is rare but has happened.

A New View
==========

I've included all the columns that are shared between the two tables,
even though only a subset are actually needed. Having extra columns
has no impact and may be useful in future.

Although the view isn't actually a table, SQLAlchemy appears to wrap
it without any issues, noting that the package doesn't have any direct
support for "view models". Because we're never inserting data, we don't
need most of the kwargs when defining columns.*

*Note that the "default" kwarg doesn't affect data that's retrieved,
only data that's written (if no value is set).

Data Integrity
==============

The (new) tests cover the main scenarios.

We need to be careful with how the view interacts with the deletion /
archiving task. There are two concerns here:

- Duplicates. The deletion task inserts before it deletes [^1], so we
could end up double counting. It turns out this isn't a problem because
a Postgres UNION is an implicit "DISTINCT" [^2]. I've also verified this
manually, just to be on the safe side.

- No data. It's conceivable that the query will check the history table
just before the insertion, then check the notifications table just after
the deletion. It turns out this isn't a problem either because the whole
query sees the same DB snapshot [^3][^4].*

*I can't think of a way to test this as it's a race condition, but I'm
confident the Postgres docs are accurate.

Performance
===========

I copied the relevant (non-PII) columns from Production for data going
back to 2022-04-01. I then ran several tests.

Queries using the new view still make use of indices on a per-table basis,
as the following query plan illustrates:

                                                                                          QUERY PLAN
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       GroupAggregate  (cost=1130820.02..1135353.89 rows=46502 width=97) (actual time=629.863..756.703 rows=72 loops=1)
         Group Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
         ->  Sort  (cost=1130820.02..1131401.28 rows=232506 width=85) (actual time=629.756..708.914 rows=217563 loops=1)
               Sort Key: notifications_all_time_view.template_id, notifications_all_time_view.sent_by, notifications_all_time_view.rate_multiplier, notifications_all_time_view.international
               Sort Method: external merge  Disk: 9320kB
               ->  Subquery Scan on notifications_all_time_view  (cost=1088506.43..1098969.20 rows=232506 width=85) (actual time=416.118..541.669 rows=217563 loops=1)
                     ->  Unique  (cost=1088506.43..1096644.14 rows=232506 width=725) (actual time=416.115..513.065 rows=217563 loops=1)
                           ->  Sort  (cost=1088506.43..1089087.70 rows=232506 width=725) (actual time=416.115..451.190 rows=217563 loops=1)
                                 Sort Key: notifications_no_pii.id, notifications_no_pii.job_id, notifications_no_pii.service_id, notifications_no_pii.template_id, notifications_no_pii.key_type, notifications_no_pii.billable_units, notifications_no_pii.notification_type, notifications_no_pii.created_at, notifications_no_pii.sent_by, notifications_no_pii.notification_status, notifications_no_pii.international, notifications_no_pii.rate_multiplier, notifications_no_pii.postage
                                 Sort Method: external merge  Disk: 23936kB
                                 ->  Append  (cost=114.42..918374.12 rows=232506 width=725) (actual time=2.051..298.229 rows=217563 loops=1)
                                       ->  Bitmap Heap Scan on notifications_no_pii  (cost=114.42..8557.55 rows=2042 width=113) (actual time=1.405..1.442 rows=0 loops=1)
                                             Recheck Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
                                             Filter: ((key_type)::text = ANY ('{normal,team}'::text[]))
                                             ->  Bitmap Index Scan on ix_notifications_no_piiservice_id_composite  (cost=0.00..113.91 rows=2202 width=0) (actual time=1.402..1.439 rows=0 loops=1)
                                                   Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND (notification_type = 'sms'::notification_type) AND (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[])) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
                                       ->  Index Scan using ix_notifications_history_no_pii_service_id_composite on notifications_history_no_pii  (cost=0.70..906328.97 rows=230464 width=113) (actual time=0.645..281.612 rows=217563 loops=1)
                                             Index Cond: ((service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'::uuid) AND ((key_type)::text = ANY ('{normal,team}'::text[])) AND (notification_type = 'sms'::notification_type) AND (created_at >= '2022-05-01 23:00:00'::timestamp without time zone) AND (created_at < '2022-05-02 23:00:00'::timestamp without time zone))
                                             Filter: (notification_status = ANY ('{sending,sent,delivered,pending,temporary-failure,permanent-failure}'::text[]))
       Planning Time: 18.032 ms
       Execution Time: 759.001 ms
      (21 rows)

Queries using the new view appear to be slower than without, but the
differences I've seen are minimal: the original queries execute in
seconds locally and in Production, so it's not a big issue.

Notes: Performance
==================

I downloaded a minimal set of columns for testing:

      \copy (
        select
          id, notification_type, key_type, created_at, service_id,
          template_id, sent_by, rate_multiplier, international,
          billable_units, postage, job_id, notification_status
        from notifications
      ) to 'notifications.csv' delimiter ',' csv header;

      CREATE TABLE notifications_no_pii (
          id uuid NOT NULL,
          notification_type public.notification_type NOT NULL,
          key_type character varying(255) NOT NULL,
          created_at timestamp without time zone NOT NULL,
          service_id uuid,
          template_id uuid,
          sent_by character varying,
          rate_multiplier numeric,
          international boolean,
          billable_units integer NOT NULL,
          postage character varying,
          job_id uuid,
          notification_status text
      );

      copy notifications_no_pii	 from '/Users/ben.thorner/Desktop/notifications.csv' delimiter ',' csv header;

      CREATE INDEX ix_notifications_no_piicreated_at ON notifications_no_pii USING btree (created_at);
      CREATE INDEX ix_notifications_no_piijob_id ON notifications_no_pii USING btree (job_id);
      CREATE INDEX ix_notifications_no_piinotification_type_composite ON notifications_no_pii USING btree (notification_type, notification_status, created_at);
      CREATE INDEX ix_notifications_no_piiservice_created_at ON notifications_no_pii USING btree (service_id, created_at);
      CREATE INDEX ix_notifications_no_piiservice_id_composite ON notifications_no_pii USING btree (service_id, notification_type, notification_status, created_at);
      CREATE INDEX ix_notifications_no_piitemplate_id ON notifications_no_pii USING btree (template_id);

And similarly for the history table. I then created a sepatate view
across both of these temporary tables using just these columns.

To test performance I created some queries that reflect what is run
by the billing [^5] and status [^6] tasks e.g.

      explain analyze select template_id, sent_by, rate_multiplier, international, sum(billable_units), count(*)
      from notifications_all_time_view
      where
      notification_status in ('sending', 'sent', 'delivered', 'pending', 'temporary-failure', 'permanent-failure')
      and key_type in ('normal', 'team')
      and created_at >= '2022-05-01 23:00'
      and created_at < '2022-05-02 23:00'
      and notification_type = 'sms'
      and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
      group by 1,2,3,4;

      explain analyze select template_id, job_id, key_type, notification_status, count(*)
      from notifications_all_time_view
      where created_at >= '2022-05-01 23:00'
      and created_at < '2022-05-02 23:00'
      and notification_type = 'sms'
      and service_id = 'c5956607-20b1-48b4-8983-85d11404e61f'
      and key_type in ('normal', 'team')
      group by 1,2,3,4;

Between running queries I restarted my local database and also ran
a command to purge disk caches [^7].

I tested on a few services:

- c5956607-20b1-48b4-8983-85d11404e61f on 2022-05-02 (high volume)
- 0cc696c6-b792-409d-99e9-64232f461b0f on 2022-04-06 (highest volume)
- 01135db6-7819-4121-8b97-4aa2d741e372 on 2022-04-14 (very low volume)

All execution results are of the same magnitude using the view compared
to the worst case of either table on its own.

[^1]: 00a04ebf54/app/dao/notifications_dao.py (L389)
[^2]: https://stackoverflow.com/questions/49925/what-is-the-difference-between-union-and-union-all
[^3]: https://www.postgresql.org/docs/current/transaction-iso.html
[^4]: https://dba.stackexchange.com/questions/210485/can-sub-selects-change-in-one-single-query-in-a-read-committed-transaction
[^5]: 00a04ebf54/app/dao/fact_billing_dao.py (L471)
[^6]: 00a04ebf54/app/dao/fact_notification_status_dao.py (L58)
[^7]: https://stackoverflow.com/questions/28845524/echo-3-proc-sys-vm-drop-caches-on-mac-osx
2022-05-19 15:14:32 +01:00
Ben Thorner
648952eb38 Standardise tests for nightly billing sub-task 2022-05-19 13:40:00 +01:00
Ben Thorner
a9104a2ec3 Simplify test for status aggregation
We don't benefit from testing a third service and simplifying the
test will make it easier to incorporate anothere edge case later.
2022-05-19 13:39:46 +01:00
Ben Thorner
458e997706 Recalculate billing rows for 10 days (prev. 4)
This effectively reverts [^1], which was only a temporary change.
I suspect the performance problem will go away with [^2].

While we've clearly been managing without this change, it resulted
in several rows being left as incorrect when letter receipts were
delayed. It makes sense for us to run this task for the same period
as we do to aggregate statuses, as status affects billing.

[^1]: e5c76ffda7
[^2]: https://github.com/alphagov/notifications-api/pull/3542
2022-05-17 17:38:08 +01:00
Leo Hemsted
6181c60f75 remove usage of notify_db fixture in unit tests
* notify_db fixture creates the database connection and ensures the test
  db exists and has migrations applied etc. It will run once per session
  (test run).
* notify_db_session fixture runs after your test finishes and deletes
  all non static (eg type table) data.

In unit tests that hit the database (ie: most of them), 99% of the time
we will need to use notify_db_session to ensure everything is reset. The
only time we don't need to use it is when we're querying things such as
"ensure get X works when database is empty". This is such a low
percentage of tests that it's easier for us to just use
notify_db_session every time, and ensure that all our tests run much
more consistently, at the cost of a small bit of performance when
running tests.

We used to use notify_db to access the session object for manually
adding, committing, etc. To dissuade usage of that fixture I've moved
that to the `notify_db_session`. I've then removed all uses of notify_db
that I could find in the codebase.

As a note, if you're writing a test that uses a `sample_x` fixture, all
of those fixtures rely on notify_db_session so you'll get the teardown
functionality for free. If you're just calling eg `create_x` db.py
functions, then you'll need to make you add notify_db_session fixture to
your test, even if you aren't manually accessing the session.
2022-05-04 11:36:54 +01:00
Ben Thorner
c27107fa74 Remove support for Reach provider
This provider was never active and support was never completed, so
there's little value in keeping all this potentially confusing code.
2022-04-29 12:28:08 +01:00
Ben Thorner
779b8e941f Rewrite broadcast Zendesk alert at approval time
The new alert happens earlier but is otherwise the same:

- We only create a ticket in Production.
- We only create a ticket on approval.

I took this opportunity to refactor the alert as a private function
and test this specifically in detail to avoid lots of repetitive
mocks, which are required when calling the main "update" function.

One test I haven't preserved was for when the "names" array is empty,
as this was added for a legacy data integrity scenario [^1].

[^1]: bf0bf4e31c
2022-04-05 12:57:08 +01:00
Ben Thorner
b439fd0718 Add boilerplate for Reach SMS callbacks
This is enough to update a notification in DB:

1. First create a notification in the UI and sent it.

2. Then reset its attributes to pretend it's for Reach.

    update notifications set
      sent_at = null,
      sent_by = null,
      notification_status='sending'
    where id='some-uuid';

3. Change "notification_id" to "<some-uuid>" in the code.

4. Call the boilerplate endpoint for Reach callbacks.

    curl -X POST localhost:6011/notifications/sms/reach

Interestingly there's no foreign key constraint on "sent_by" in the
DB, so this just works: the notification is updated.
2022-03-24 16:56:33 +00:00
Ben Thorner
a962721915 Move SMS client response tests to match their app/
This makes them easier to find and removes ambiguity with other
files related to "processing" in the app.
2022-03-24 16:47:30 +00:00
Leo Hemsted
00259893f1 automatically retry letters stuck in pending-virus-scan
Since sept 2019 we've had to log on to production around once every
twenty days to restart the virus scan task for a letter. Most of the
time this is just a case of making sure the file is in the scan bucket,
and then triggering the task. If the file isn't in the scan bucket we'd
need to do some more manual investigation to find out exactly where the
file got stuck, but I can only remember times when it's been in the scan
bucket.

So if the file is in the scan bucket, we can just check that with code
and kick the task off automatically.
2022-03-07 18:31:46 +00:00
Katie Smith
bff4e3a709 Fix test which was flakey due to order of items returned from db
The test was querying `FactNotificationStatus` and ordering the results
by bst_date and notification_type then checking the rows. However, the
bst_date and notification_type for each row is the same, so this test
could fail based on the order that the results came back in. By ordering
on the notification_status instead, we can be sure of the order of the
results.
2022-02-28 13:03:55 +00:00
Ben Thorner
574b1d3c63 Fix not deleting dead rows in status table
To address: https://github.com/alphagov/notifications-api/pull/3454#pullrequestreview-880302729

Since we now delete all rows before inserting fresh ones, we no
longer need to worry about conflicts. I've also extended the old
test to check all three kinds of overwrite: new, changed, gone.
2022-02-16 13:40:08 +00:00
Ben Thorner
1213463b8e Only aggregate status when necessary for a service
This takes a similar approach to the nightly deletion task so that
we only create sub-tasks when there are actually notifications to
aggregate for a given type and day [1].

We're making this change to stop the duplication errors we're getting
at the moment and ensure the task can scale to more messages and more
services. There are two parts to this:

- Each subtask should now run within the 5 minute visibility timeout.
However, they may still be duplicated if the parent task overruns [2].

- The parent task creates a mininal number of subtasks, and the query
to determine this is very fast for a normal process day (milliseconds).

Since all tasks will run quickly, there should be no more duplication.

In order to test this more nuanced task, I rewrote the tests:

- One test checks the subtask is called correctly.
- One test checks we create all the right subtasks.

[1]: https://github.com/alphagov/notifications-api/pull/3381
[2]: https://docs.google.com/document/d/1MaP6Nyy3nJKkuh_4lP1wuDm19X8LZITOLRd9n3Ax-xg/edit#heading=h.q3intzwqhfzl
2022-02-09 17:39:07 +00:00
Ben Thorner
018a253b6f Revert "Revert running status aggregation in parallel"
This reverts commit 0f6dea0deb.
2022-02-09 17:39:00 +00:00
Chris Hill-Scott
0614a70764 Merge pull request #3445 from alphagov/bump-utils-53.0.0
Bump utils to 53.0.0
2022-02-08 09:56:18 +00:00
Chris Hill-Scott
7f72d3a60f Bump utils to 53.0.0
Changes:

53.0.0
---

* `notifications_utils.columns.Columns` has moved to
  `notifications_utils.insensitive_dict.InsensitiveDict`
* `notifications_utils.columns.Rows` has moved to
  `notifications_utils.recipients.Rows`
* `notifications_utils.columns.Cell` has moved to
  `notifications_utils.recipients.Cell`

52.0.0
---

* Deprecate the following unused `redis_client` functions:
  - `redis_client.increment_hash_value`
  - `redis_client.decrement_hash_value`
  - `redis_client.get_all_from_hash`
  - `redis_client.set_hash_and_expire`
  - `redis_client.expire`

51.3.1
---

* Bump govuk-bank-holidays to cache holidays for next year.
2022-02-08 09:45:10 +00:00
David McDonald
1d8fafcdf4 Remove unused functions
Can't see these being used anywhere so lets get rid of them
2022-02-07 15:58:04 +00:00
Rebecca Law
09c8fbe982 Merge pull request #3418 from alphagov/letters-too-long
Mark letters as validation-failed if the templated letter is too long.
2022-02-02 08:30:50 +00:00
Rebecca Law
c01c81326c Update log message to something a little easier to read and query for. 2022-01-24 12:25:53 +00:00
Ben Thorner
0f6dea0deb Revert running status aggregation in parallel
The top-level task didn't run successfully after this was deployed
due to the worker being killed due to heavy disk usage. While the
more parallel version does log much more, it doesn't totally explain
the disk behaviour. Nonetheless, reverting it is sensible to give us
the time we need to investigate more.
2022-01-20 12:22:33 +00:00
Rebecca Law
6cd7a23d3c If there is an invalid letter that has not been updated to validation-failed because the update-validation-failed-for-templated-letter has not been picked up off the letter-tasks queue and the collate-letter-pdfs-to-be-sent has started.
1. The number of letters that we send to DVLA will be not be correct (see 20ead82463/app/celery/letters_pdf_tasks.py (L136))
This may raise an alert with DVLA when they find we have sent them fewer letter than we have reported.
2. When we get the PDF from S3 we will get a file not found 20ead82463/app/celery/letters_pdf_tasks.py (L244)
The error will not prevent the collate task from completing but we will see an alert email for the exception and raise questions.

Although this situation is very unlikely because we have a 15 minute window between the last letter deadline date and the time we kick off the collate task we should still mitigate these issues. I updated the queries to only return letters with billable_units > 0, all valid letters should have at least 1 billable unit.
2022-01-19 08:31:19 +00:00
Rebecca Law
841a4fc22f Mark letters as validation-failed if the templated letter is too long.
It is possible that the personalisation for a templated letter can make the letter exceed 10 pages or 5 sheets. We are not validating the letters posted via the API for this validation error. It is only possible to validate the letter once we create the PDF in notifications-template-preview. This means that the letter can only get a validation-failed status after the client has received a 201 from the POST to /v2/notifications.
NOTE: we only validate the preview row of a CSV for this validation error, this change will mean that it is possible for a letter to be marked as validation-failed after a successful file upload.

A new task to update the notification to `validation-failed` has been added to the API. If we find that the letter is too long once we have created the PDF we call the `update-validation-failed-for-templated-letter` task rather than `update-billable-units-for-letter` task.

New work flow for a letter in brief:
API - receives POST /v2/notifications
:: save to db
:: put CREATE_LETTERS_PDF task on queue for template preview to consume
TEMPLATE-PREVIEW - consumes task CREATE_LETTERS_PDF
:: create PDF
:: count pages of PDF
:: IF page count exceeds 10 pages
	 put in the letters-invalid-pdf S3 bucket with metadata (similar to the precompiled letters)
	 put `update-validation-failed-for-templated-letter` task on the queue for the API to consume
   ELSE
     put PDF in the `letters-pdf` bucket
     put `update-billable-units-for-letter` task on the queue
API - consumes `update-billable-units-for-letter` OR `update-validation-failed-for-templated-letter` task
:: IF `update-billable-units-for-letter` task:
   	update billable units for notification as usual
:: ELSE `update-validation-failed-for-templated-letter`:
   	update notification_status = `validation-failed`
ADMIN - view notification page for letter
:: show validation letter for templated letter

There will be 3 PRs in order to make this change, one for the API, template-preview and the admin app.

Deployment plan

Deploy Admin first
Deploy API
Deploy template-preview

Related PRs:
alphagov/notifications-template-preview#619
alphagov/notifications-admin#4107

https://www.pivotaltracker.com/story/show/169209742
2022-01-19 08:29:48 +00:00
Ben Thorner
9182ebf4e5 Parallelise status aggregation by service and day
This follows a similar approach as [1]. Recently we've seen lots
of errors from this task, which we think are a consequence of it
doing too much work and tripping Celery's visibility timeout.

While we can optimise the query [2], it's likely the errors will
return as the number of live services grows. Parallelising the
aggregation now will make it more futureproof.

[1]: https://github.com/alphagov/notifications-api/pull/3397
[2]: https://github.com/alphagov/notifications-api/pull/3417
2022-01-12 15:47:59 +00:00
Ben Thorner
9fc8b904c6 DRY up status aggregation tests (move DAO tests up)
The previous DAO tests were also confusing because they were testing
two functions at the same time, so moving the tests up to the task
level seems very reasonable, and will make it easier to change how
this code works in the next commits.
2022-01-11 16:11:36 +00:00
Leo Hemsted
49cc1b643f split delete task up into per service
we really don't gain anything by running each service delete in sequence
- we get the services, and then just loop through them deleting per
service. By deleting per service in separate tasks, we can take
advantage of parallelism. the only thing we lose is some log lines but I
don't think we're that interested in them.

only set query limit at the move_notifications dao function - the task
doesn't really care about the technical implementation of how it deletes
the notifications
2021-12-14 15:24:34 +00:00
Ben Thorner
87cd40d00a Scale timeout task to work on arbitrary volumes
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.

This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.

[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
2021-12-13 17:14:28 +00:00
Ben Thorner
76aeab24ce Rewrite DAO timeout method to take cutoff_time
Previously we specified the period and calculated the cutoff time
in the function. Passing it in means we can run the method multiple
times and avoid getting "new" notifications to time out in the time
it takes to process each batch.
2021-12-13 16:56:21 +00:00
Ben Thorner
ab4cb029df Remove alert for email / sms in created
In response to [1].

[1]: https://github.com/alphagov/notifications-api/pull/3383#discussion_r759379988

It turns out the code that inspired this new alert - in the old
"timeout-sending-notifications" task - was actually redundant as
we already have a task to "replay" notifications still in "created",
which is much better than just alerting about them.

It's possible the replayed notifications will also fail, but in
both cases we should see some kind of error due to this, so I don't
think we're losing anything by not having an alert.
2021-12-06 14:11:42 +00:00
Ben Thorner
9bd2a9b427 Extract tests for conditionally creating callback
This will help ensure the function doesn't change arbitrarily, now
that it's used in multiple other places.
2021-12-06 14:11:41 +00:00
Ben Thorner
04da017558 DRY-up conditionally creating callback tasks
This removes 3 duplicate instances of the same code, which is still
tested implicitly via test_process_ses_receipt_tasks [1]. In the
next commit we'll make this test more explicit, to reflect that it's
now being reused elsewhere and shouldn't change arbitrarily.

We do lose the "print" statement from the command instance of the
code, but I think that's a very tolerable loss.

[1]: 16ec8ccb8a/tests/app/celery/test_process_ses_receipts_tasks.py (L94)
2021-12-06 14:11:34 +00:00
Ben Thorner
aea555fce2 Make test for timeout with callbacks consistent
This now matches the behaviour of the test above it: mocking out
the DAO function in order to focus on the specific behaviour of
the function under test.
2021-12-06 14:00:42 +00:00
Ben Thorner
5bf3fe6c0f Clarify no callbacks are sent in timeout test
This now complements the test below it, which we will refactor to
be consistent in the next commit.
2021-12-06 14:00:41 +00:00
Ben Thorner
8b7e81958d Delete duplicate 'timeout' tests for notifications
These scenarios are already covered by the DAO tests. It's enough
to just check the DAO function is called as expected.

While sometimes it can be better to have more end-to-end tests, the
convention across much of this app is to do unit tests.
2021-12-06 14:00:40 +00:00
Ben Thorner
c3e11d676f Remove unnecessary test_request_context manager
This doesn't affect how the tests run and just adds complexity.
2021-12-06 14:00:39 +00:00
Ben Thorner
05bd26d444 Fix names for a few tests in test_nightly_tasks.py
I find it really difficult to visually parse test files unless we
have a consistent convention for how we name our test functions.
In most of our tests the name of the test function starts with the
name of the function under test.
2021-12-06 14:00:38 +00:00
Ben Thorner
0318229216 Stop 'timing out' old 'created' notifications
This is being replaced with a new alert and runbook [1]. It's not
always appropriate to change the status to 'technical-failure', and
the new alert means we'll act to fix the underlying issue promptly.

We'll look at tidying up the remaining code in the next commits.

[1]: https://github.com/alphagov/notifications-manuals/wiki/Support-Runbook#deal-with-email-or-sms-still-in-created
2021-12-06 14:00:36 +00:00
Ben Thorner
f96ba5361a Add new task to alert about created email / sms
This will log an error when email or SMS notifications have been
stuck in 'created' for too long - normally they should be 'sending'
in seconds, noting that we have a goal of < 10s wait time for most
notifications being processed our platform.

In the next commits we'll decouple similar functionality from the
existing 'timeout-sending-notifications' task.
2021-12-06 14:00:31 +00:00
Ben Thorner
528223ed61 Use central NotifyCelery base class in utils
Note that the new base class doesn't include a bespoke feature we
had here: 'log_on_worker_shutdown'. We've agreed it's reasonable
to remove it for now as it was introduced many years ago and its
use case is unclear - we can always add it back if needed.
2021-11-16 13:58:12 +00:00
Chris Hill-Scott
0236318189 Republish gov.uk/alerts every night to clear down planned tests
We have made it so that gov.uk/alerts shows a ‘1 planned test’ banner
for the whole of the day when there has been an operator test on that
day.

We need to remove the banner when the day is over.

The most straightforward way to do this is to republish the site at the
start of every day. The gov.uk/alerts code[1] will work out if there are
or aren’t any planned tests to show that day.

1. 5a274af6d0/app/models/alerts.py (L38-L44)
2021-11-15 14:23:32 +00:00
Ben Thorner
d66c68d6d6 Merge pull request #3364 from alphagov/celery-headers-request-id-180213914
Move Celery task Request ID injection into headers
2021-11-12 11:10:29 +00:00
Ben Thorner
89a8dd1a03 Move Celery task Request ID injection into headers
Previously we passed along this piece of state via the kwargs for
a task, but this runs the risk of the task accidentally receiving
the extra kwarg unless we've covered all the code paths that could
invoke it directly e.g. retries don't invoke __call__.

This switches to using Celery "headers" to pass the extra state. It
turns out that a Celery has two "header" concepts, which leads to
some confusion and even a bug with the framework [1]:

- In older (pre v4.4) versions of Celery, the "headers" specified
by apply_async() would become _the_ headers in the message that
gets passed around workers, etc. These would be available later on
via "self.request.headers".

- Since Celery protocol v2, the meaning of "headers" in the message
changed to become (basically) _all_ metadata about the task [2],
with the "headers" option in apply_async() being merged [3] into
the big dict of metadata.

This makes using headers a bit confusing unfortunately, since the
data structure we put in is subtly different to what comes out in
the request context. Nonetheless, it still works. I've added some
comments to try and clarify it.

Note that one of the original tests is no longer necessary, since we
don't need to worry about argument passing styles with headers.

[1]: https://github.com/celery/celery/issues/4875
[2]: 663e4d3a0b (diff-07a65448b2db3252a9711766beec23372715cd7597c3e309bf53859eabc0107fR343)
[3]: 681a922220/celery/app/amqp.py (L495)
2021-11-10 18:03:40 +00:00