Commit Graph

929 Commits

Author SHA1 Message Date
Ben Thorner
cdc150de1b Change link test task to trigger both lambda
This modifies the previous "(_)send_link_test" method to trigger a
link test for a specific lambda. We then call the method with both
the primary and failover lambda in new orchestrator method.

Since the _invoke_lambda function doesn't raise exceptions if it
fails, there's no need to rescue anything in order to ensure the
second link test / invocation runs as well. It doesn't testing for
this, since it boils to an absence of code to raise any exception.

Note that, like the other parent tests, we only check the new method
works with a specific proxy client instance.
2021-07-19 16:00:56 +01:00
Ben Thorner
08f48379b4 Move ID generation into link test method
Unlike the other IDs which are stored in the DB, this isn't relevant
for the Celery task as it invokes a link test. Moving it into the
proxy client will also enable us to generate a second ID in the next
commits, where we start doing a link test for the failover lambda.
2021-07-19 16:00:55 +01:00
Ben Thorner
7fb65761f9 Always log when a lambda is invoked
This replaces the previous single-purpose log about a link test
with a more informative and generic one for all invocations.
2021-07-19 15:45:43 +01:00
Ben Thorner
b6774bf0f7 Generate Vodafone link test sequence nos in proxy
Previously the Celery task to trigger a link test had to know about
the special case of a sequence number for Vodafone. Since we're about
to change the client to perform multiple tests it makes sense to give
it the knowledge of how to generate number itself.

Note that we have to import the db inline to avoid a circular import,
since this module is itself imported by app/__init__.py.

Other invocations of the Vodafone client use stored sequence numbers
from the DB, which are called "message numbers" in that context. Since
the two use cases are very different (even the names are different!),
having them in two places shouldn't cause any confusion.
2021-07-19 15:43:36 +01:00
Leo Hemsted
2ad9a3a380 retry service callbacks on 429
if we're served a 429, put the item on the retry queue and retry the
same as if the service returned a 5xx. 429 is commonly returned for rate
limit exceeding, and retrying on a delay is a typical response to that.
2021-07-13 16:09:17 +01:00
Rebecca Law
18dd9050a4 - make sure when processing a job that we check the total_sent + job.notification_count against the service.message_limit. 2021-06-28 13:07:48 +01:00
Rebecca Law
fd7486d751 - Merge daily limit functions into one, refactor call for daily limit check from process_job
- refactor tests to standardise test names
- refactor some tests to be more clear
- remove unnecessary tests
- include missing test
2021-06-24 11:05:22 +01:00
Rebecca Law
35b20ba363 Correct the daily limits cache.
Last year we had an issue with the daily limit cache and the query that was populating it. As a result we have not been checking the daily limit properly. This PR should correct all that.

The daily limit cache is not being incremented in app.notifications.process_notifications.persist_notification, this method is and should always be the only method used to create a notification.
We increment the daily limit cache is redis is enabled (and it is always enabled for production) and the key type for the notification is team or normal.

We check if the daily limit is exceed in many places:
 - app.celery.tasks.process_job
 -  app.v2.notifications.post_notifications.post_notification
 - app.v2.notifications.post_notifications.post_precompiled_letter_notification
 - app.service.send_notification.send_one_off_notification
 - app.service.send_notification.send_pdf_letter_notification

If the daily limits cache is not found, set the cache to 0 with an expiry of 24 hours. The daily limit cache key is service_id-yyy-mm-dd-count, so each day a new cache is created.

The best thing about this PR is that the app.service_dao.fetch_todays_total_message_count query has been removed. This query was not performant and had been wrong for ages.
2021-06-22 16:15:36 +01:00
Rebecca Law
8af10eb1f0 Update the job_status to in-progress sooner.
We had a situation where the delivery-worker app instance was terminated before the job was marked as `in-progress`, presumably because the query to check the daily limits was taking too long to complete.
If the job was in progress the `check_job_status` task would have restarted the job.
Updating the status to in-progress sooner will help.
2021-06-15 07:58:17 +01:00
Rebecca Law
1bf5ce08b2 Add a error log for alert tasks.
Many of the team members do not look at emails from zendesk, adding a current_app.logger.error message for things we care about to give developers a better chance of seeing them.
I have purposely not added an erro log for `check_for_services_with_high_failure_rates_or_sending_to_tv_numbers` because it's not something we need to look at immediately.
2021-05-26 11:06:21 +01:00
Katie Smith
8365c749e4 Change letter zip file names for Insolvency Service letters
DVLA would like to be able to identify letters sent by the Insolvency
Service, so we are changing the zipfile name. They need all zipfile
names to have the same structure, so we can't just add a marker to files
sent by that service - we have to change all filenames.

The new format is like this:
`{NOTIFY}.{DATE}.{SEQUENCE_ID}.{UNIQUE_ID}.{SERVICE_ID}.{ORG_NAME}.{EXTENSION}`
2021-05-06 09:18:44 +01:00
Ben Thorner
23f4ae32df Merge pull request #3214 from alphagov/check-broadcast-suspended
Enforce service suspension for broadcasts
2021-04-28 15:01:11 +01:00
Ben Thorner
99bc29418e Move request_id injection into send_task override
This applies the same change we made in other apps [1][2]. Adding
the override here is special, though, because it means the others
will now get triggered, since this app is the start of the chain
of tasks for a request. We will also retain existing request_id
tracing for tasks within this app, since "apply_async" calls the
"send_task" method internally, which is the one we're overriding.

[1]: 6f3c118a1e
[2]: 2e08b7aa95
2021-04-27 10:35:21 +01:00
Katie Smith
e6357c91c9 Add more details to messages in send_broadcast_provider_message task
This ensures that the log messages both contain broadcast_event id and
broadcast_provider_message id. It also removes the broadcast_event
reference since this isn't particularly useful in helping to find an
event.
2021-04-20 15:34:49 +01:00
Katie Smith
c9c4bd8b44 Clarify log line when sending a link test
It wasn't clear what the ID in the message was. It's not possible to add
more details to the message - we don't create a broadcast message or
event for a link test.
2021-04-20 14:54:53 +01:00
Ben Thorner
a2af8b052a Split up authorisation vs. sequencing checks
While both of these are integrity errors (since we should never
reach this point in the code + data), this just means the original
method comment is still relevant to what immediately follows it.
2021-04-19 17:13:15 +01:00
Ben Thorner
ee52e3e2c9 Mirror integrity checks from the API
It makes sense to have these checks [1] here, since in future we may
add other ways of creating a broadcast event and omit them.

[1]: 3d71815956/app/broadcast_message/rest.py (L198)
2021-04-19 17:13:13 +01:00
Ben Thorner
0070473f31 Check for suspension before sending a broadcast
This mirrors the check we do for jobs, which are also a high-impact
task [1]. While this shouldn't be possible, just like other checks
we're adding it here to be doubly certain.

[1]: 3d71815956/app/celery/tasks.py (L74)
2021-04-19 17:13:12 +01:00
Ben Thorner
b2398fcaf4 Rename CBCProxyFatalException
We only actually use this when the data we're working with is in an
unexpected state, which is unrelated to the CBC Proxy. Using this
name also means we can re-use this exception in the next commits.

Note that we may still care if a broadcast message has expired, since
it's not expected that someone would send one in this condition.
2021-04-19 17:13:05 +01:00
Rebecca Law
34a378a60e Update the Zendesk ticket content for
`check_if_letters_still_in_created`

The message to Zendesk includes a list of notification ids, this isn't
really necessary and is included in the run book. Creation of the
Zendesk ticket can fail if the message is too long, removing the list of
ids can prevent that from happening.
2021-04-19 10:47:25 +01:00
Ben Thorner
be02573147 Fix apply_async not working with positional kwargs
Celery's apply_async function accepts 'kwargs' as (get ready to be
confused) either a positional argument, or a keyword argument:

Positional: apply_async(['args'], {'kw': 'args'})

Keyword: apply_async(args=['args'], kwargs={'kw': 'args'})

We rely on the positional form in at least one place [1]. This fixes
the overload of apply_async to cope with both forms, and continue to
pass through any other (confusion time again) keyword args to super(),
such as queue="queue".

Note that we've also decided to stop accepting other positional args,
since this is unnecessarily confusing, and we don't currently rely on
it in our code. This stops it creeping in in future.

[1]: fde927e00e/app/job/rest.py (L186)
2021-04-15 17:21:21 +01:00
Ben Thorner
fde927e00e Merge pull request #3205 from alphagov/celery-consistency-tweaks
Small refactor for new Celery / StatsD code
2021-04-15 11:40:42 +01:00
Ben Thorner
f85dad5acf Merge pull request #3203 from alphagov/remove-statsd-decorators
Remove redundant @statsd timing decorators
2021-04-14 10:04:04 +01:00
Ben Thorner
5eb265138b Remove unnecessary statsd_client parameter
It turns out this is available from the app object [1], and we were
already assuming this in the tests.

[1]: 48c6c822e8/notifications_utils/clients/statsd/statsd_client.py (L52)
2021-04-13 15:12:55 +01:00
Ben Thorner
ec6d87cd0f Simplify argument passing in apply_async
This avoids the need to keep in-sync with any future changes to the
signature, and reduces the amount of irrelevant code to read.
2021-04-13 15:12:45 +01:00
David McDonald
2e6d761691 Merge pull request #3204 from alphagov/broadcast-envars
Broadcast envars
2021-04-12 17:25:15 +01:00
David McDonald
295162c81d Move CBC proxy enable check
This change will make our development environments closer to production
even if they aren't hooked up to the CBC proxy lambda functions.

Now in development, we will create the broadcast event and create tasks
for each broadcast provider event. We will still not create actual
broadcast provider message rows in the DB and talk to the CBC proxies.

This should be helpful in development to catch any issues we introduce
to do with sending broadcast messaging. In time we may wish to have some
fake CBC proxies in the AWS tools account that we can interact with to
make it even more realistic.
2021-04-12 17:05:41 +01:00
Ben Thorner
e3e067c795 Remove redundant @statsd timing decorators
These are superseded by timing task execution generically in the
NotifyTask superclass [1]. Note that we need to wait until we've
gathered enough data under the new metrics before removing these.

[1]: https://github.com/alphagov/notifications-api/pull/3201#pullrequestreview-633549376
2021-04-12 15:19:18 +01:00
Ben Thorner
3e507eea55 Merge pull request #3201 from alphagov/revamp-celery-stats
Migrate towards new metrics for Celery tasks
2021-04-12 15:04:37 +01:00
Ben Thorner
ab8dd6d52c Duplicate metrics to StatsD for Celery tasks
Previously we used a '@statsd' decorator to time and count Celery
tasks [1]. Using a decorator isn't ideal since we need to remember
to add it to every task we define. In addition, it's not possible
to use data like the task name and queue.

In order to avoid breaking existing stats, this duplicates them as
new StatsD metrics until we have sufficient data to update dashboards
using the old ones. Using the CeleryTask superclass to send metrics
avoids a future maintenance overhead, and means we can include more
useful data in the StatsD metric. Note that the new metrics will sit
in StatsD until we add a mapping for them [2].

StatsD automatically produces a 'count' stat for timing metrics, so
we don't need to increment a separate counter for successful tasks.

[1]: dea5828d0e/app/celery/tasks.py (L65)
[2]: https://github.com/alphagov/notifications-aws/blob/master/paas/statsd/statsd-mapping.yml
2021-04-08 18:02:53 +01:00
Ben Thorner
248f5a0708 Include queue name in Celery task logs
This is mainly so we can use it in the new metrics we send to StatsD
in the following commits, but it should also be useful in the logs.
I've taken the opportunity to make the log format consistent between
success / failure, and with our Template Preview app [1].

[1]: f456433a5a/app/celery/celery.py (L19)
2021-04-08 18:02:51 +01:00
Ben Thorner
19be4faf45 Switch to monotonic time for task logs
This matches the approach we take in utils [1]. Monotonic time is
better because it avoids weird negative results due to clock shift.

[1]: 5d18ebd796/notifications_utils/statsd_decorators.py (L14)
2021-04-08 13:00:24 +01:00
Ben Thorner
054205835b Remove unused metric for SQS apply duration
This was added as part of a wider performance investigation [1]. I
checked with Leo, who made the change, and while the other metrics
are still be useful, there's no reason to keep this one.

[1]: 6e32ca5996 (diff-76936416943346b5f691dac57a64acebc6a1227293820d1d9af4791087c9fb9eR23)
2021-04-08 13:00:21 +01:00
Leo Hemsted
4a5b1c23bd only send zendesk P1 for alerts
we don't need to be re-notified when someone clicks cancel
2021-04-08 12:22:18 +01:00
Leo Hemsted
9bd8c0239c look for 'live', not 'production'
config['NOTIFY_ENVIRONMENT'] is hardcoded to `'live'` in the Live config
class. The values as seen on the environment which we send real messages
from:

```
>>> json.loads(os.environ['VCAP_APPLICATION'])['space_name']  # what cloudfoundry sets
'production'
>>> os.environ['NOTIFY_ENVIRONMENT']  # we set this from cloudfoundry
'production'
>>> current_app.config['NOTIFY_ENVIRONMENT']  # hardcoded in the Live config
'live'
>>> current_app.config['NOTIFICATION_QUEUE_PREFIX']  # pulled from env var of same name
'live'
>>> current_app.config['ENV']  # this is an unrelated flask variable
'production'
```
2021-04-08 12:17:22 +01:00
Leo Hemsted
df393e36c5 send a p1 when a broadcast goes out on production
it's important to keep tabs on when these things leave our system.
Sending a zendesk ticket that triggers a P1 is probably our simplest way
of notifying the team when this happens (it's what we do with out of
hours emergencies on the admin app too). We don't have any direct
pagerduty integrations from the api app, but we already have the zendesk
client hooked up.

After broadcasts go live, we may want to change this to a P2 (but even
then, there's arguments for keeping it P1 to start with I think).

Don't cause a P1 if it goes out on staging as that might be MNOs testing.
2021-04-06 11:32:19 +01:00
Katie Smith
3bfd084a77 Simplify send_delivery_status_to_service code
Now that https://github.com/alphagov/notifications-api/pull/3184 has
been deployed for a while, the `send_delivery_status_to_service` task will
always have `template_id` and `template_version` being passed in. This
means we don't need to check if those fields are there.
2021-03-30 08:50:42 +01:00
David McDonald
6d410daae4 Remove the emergency alerts canary
See https://github.com/alphagov/notifications-broadcasts-infra/pull/197
for why we no longer need this and we get to delete some code!
2021-03-26 18:31:53 +00:00
Pea Tyczynska
52c529ab3a Use personalisation to set client_reference for letters
which were sent through Notify interface only. This is done
to avoid performance dip from additional operation for
other notification types.
2021-03-24 14:55:10 +00:00
Ben Thorner
b2b14f39a3 Merge pull request #3183 from alphagov/remove-crown-letter-filename
Remove non/crown indicator in letter filenames
2021-03-24 13:06:58 +00:00
Katie Smith
27b3cece7d Send template id and version with delivery status callback
This adds the `template_id` and `template_version` fields to the data
sent to services from the `send_delivery_status_to_service` task.

We need to account for the task not being passed these fields at first
since there might be tasks retrying which don't have that data. Once all
tasks have been called with the new fields we can then update the code
to assume they are always there.

Since we only send delivery status callbacks for SMS and emails, I've
removed the tests where we call that task with letters.
2021-03-24 10:55:45 +00:00
Ben Thorner
8219b3c032 Remove non/crown indicator in letter filenames
This is not required by DVLA and since [1] we no longer care about
the end of letter filenames when collating them, so removing it is
safe to do. Note that the name of the ZIP files of collated letters
is based on a hash of the filenames, which needed updating in tests.

Before merging this we need to do a test run in Staging, so DVLA can
check that a mixture of the old / new filenames won't cause issues.

[1]: https://github.com/alphagov/notifications-api/pull/3172
2021-03-18 13:05:12 +00:00
Katie Smith
3b78f863d5 Check for incomplete pending jobs
We have a scheduled task that was checking for jobs still in progress.
We saw a case where a scheduled job was stuck in a `pending` status as a
result of an app shutting down. This changes the `check_job_status` task
so that it also checks for scheduled jobs which are still pending after
30 minutes.
2021-03-18 08:24:36 +00:00
Ben Thorner
c76e789f1e Reduce extra S3 ops when working with letter PDFs
Previously we did some unnecessary work:

- Collate task. This had one S3 request to get a summary of the object,
which was then used in another request to get the full object. We only
need the size of the object, which is included in the summary [1].

- Archive task. This had one S3 request to get a summary of the object,
which was then used to make another request to delete it. We still need
both requests, but we can remove the S3.Object in the middle.

[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#objectsummary
2021-03-16 12:53:13 +00:00
Leo Hemsted
6784ae62a6 Raise Exception if letter PDF not in S3
Previously, the function would just return a presumed filename. Now that
it actually checks s3, if the file doesn't exist it'll raise an
exception. By default that's a StopIteration at the end of the bucket
iterator, which isn't ideal as this will get supressed if the function
is called within a generator loop further up or anything.

There are a couple of places where we expect the file may not exist, so
we define a custom exception to rescue specifically here. I did consider
subclassing boto's ClientError, but this wasn't straightforward as the
constructor expects to know the operation that failed, which for me is a
signal that it's not an appropriate (re-)use of the class.
2021-03-15 17:18:11 +00:00
Ben Thorner
b43a367d5f Relax lookup of letter PDFs in S3 buckets
Previously we generated the filename we expected a letter PDF to be
stored at in S3, and used that to retrieve it. However, the generated
filename can change over the course of a notification's lifetime e.g.
if the service changes from crown ('.C.') to non-crown ('.N.').

The prefix of the filename is stable: it's based on properties of the
notification - reference and creation - that don't change. This commit
changes the way we interact with letter PDFs in S3:

- Uploading uses the original method to generate the full file name.
The method is renamed to 'generate_' to distinguish it from the new one.

- Downloading uses a new 'find_' method to get the filename using just
its prefix, which makes it agnostic to changes in the filename suffix.

Making this change helps to decouple our code from the requirements DVLA
have on the filenames. While it means more traffic to S3, we rely on S3
in any case to download the files. From experience, we know S3 is highly
reliable and performant, so don't anticipate any issues.

In the tests we favour using moto to mock S3, so that the behaviour is
realistic. There are a couple of places where we just mock the method,
since what it returns isn't important for the test.

Note that, since the new method requires a notification object, we need
to change a query in one place, the columns of which were only selected
to appease the original method to generate a filename.
2021-03-15 13:55:44 +00:00
David McDonald
41d95378ea Remove everything for the performance platform
We no longer will send them any stats so therefore don't need the code
- the code to work out the nightly stats
- the performance platform client
- any configuration for the client
- any nightly tasks that kick off the sending off the stats

We will require a change in cronitor as we no longer will have this task
run meaning we need to delete the cronitor check.
2021-03-15 12:04:53 +00:00
David McDonald
8325431462 Move saving of processing time into separate task
We current do this as part of send-daily-performance-platform-stats but
now this moves it into its own separate task. This is for two reasons
- we will shortly get rid of the send-daily-performance-platform-stats
  task as we no longer will need to send anything to performance
  platform
- even if we did decide to keep the task
  send-daily-performance-platform-stats and remove the specific bits
  that relate to the performance platform, it's probably nicer to
  rewrite the new task from scratch to make sure it's all clear and easy
  to understand
2021-03-15 11:44:01 +00:00
Ben Thorner
0379d721e5 Add missing statsd timers to celery tasks
All other tasks in app/celery/*_tasks.py have timers on them. Some
of these timers will be useful to check before/after performance as
a way to reassure ourselves about the impact of [1].

[1]: https://github.com/alphagov/notifications-api/pull/3172
2021-03-12 12:32:22 +00:00
Ben Thorner
a91fde2fda Run auto-correct on app/ and tests/ 2021-03-12 11:45:45 +00:00