Commit Graph

184 Commits

Author SHA1 Message Date
Pea Tyczynska
c28e9451d4 Bump moto version to try solve dependencies version conflict
Also update mock import statements in some test files as they
stopped working with this dependency update.
2021-07-08 15:37:19 +01:00
Rebecca Law
1bf5ce08b2 Add a error log for alert tasks.
Many of the team members do not look at emails from zendesk, adding a current_app.logger.error message for things we care about to give developers a better chance of seeing them.
I have purposely not added an erro log for `check_for_services_with_high_failure_rates_or_sending_to_tv_numbers` because it's not something we need to look at immediately.
2021-05-26 11:06:21 +01:00
Rebecca Law
34a378a60e Update the Zendesk ticket content for
`check_if_letters_still_in_created`

The message to Zendesk includes a list of notification ids, this isn't
really necessary and is included in the run book. Creation of the
Zendesk ticket can fail if the message is too long, removing the list of
ids can prevent that from happening.
2021-04-19 10:47:25 +01:00
David McDonald
6d410daae4 Remove the emergency alerts canary
See https://github.com/alphagov/notifications-broadcasts-infra/pull/197
for why we no longer need this and we get to delete some code!
2021-03-26 18:31:53 +00:00
Katie Smith
3b78f863d5 Check for incomplete pending jobs
We have a scheduled task that was checking for jobs still in progress.
We saw a case where a scheduled job was stuck in a `pending` status as a
result of an app shutting down. This changes the `check_job_status` task
so that it also checks for scheduled jobs which are still pending after
30 minutes.
2021-03-18 08:24:36 +00:00
Ben Thorner
a91fde2fda Run auto-correct on app/ and tests/ 2021-03-12 11:45:45 +00:00
David McDonald
5526c89c34 Rename task and function for clarity
This doesn't just relate to precompiled letters, it's actually just
checking that there are not any letters still waiting for a virus check
that should not be. This change to the naming makes it more accurate
and therefore easy to understand
2021-02-10 15:23:53 +00:00
David McDonald
1b9d8252ec Rename task and function for clarity
This doesn't just relate to templated letters, it's actually just
checking that there are not any letters still in created that should not
be. This change to the naming makes it more accurate and therefore easy
to understand
2021-02-10 15:23:52 +00:00
David McDonald
3c0e609cc9 Add link to runbook for created letter alert
We've got the entry in the runbook, this will make it clear to go and
look at it.
2021-02-10 15:23:51 +00:00
David McDonald
070b79c27e Downgrade exceptions to warnings to reduce emails
We already trigger a zendesk ticket for these two cases, meaning that
whenever we get this situation, we get 3 emails. One for the zendesk
ticket, one from logit raising the fact an exception was raised and one
from cloudwatch raising the fact an exception was raised.

We don't need all these emails, a zendesk ticket is sufficient.
Downgrading to a warning means this event will still be findable in our
logs however.
2021-02-02 15:10:26 +00:00
David McDonald
20627d96ea Put all broadcast tasks on the broadcast worker 2021-01-13 17:21:40 +00:00
Leo Hemsted
e2fa0116a0 add CBC_PROXY_ENABLED config flag to control if tasks are triggered
previously we made some incorrect assumptions about set-up on staging
and prod - they currently don't have any cbc_proxy aws creds at all.

We shoudn't be attempting canaries or link tests when there's no AWS
infrastructure to connect to.

We also shouldn't bother writing a row into the database at all for the
broadcast_provider_message since we're not even attempting to send, and
we shouldn't get confused between messages that failed and messages we
never wanted to send at all.
2020-11-26 10:16:22 +00:00
Leo Hemsted
087cc5053d separate cbc proxy into separate clients
this is a pretty big and convoluted refactor unfortunately.

Previously:

There was one global `cbc_proxy_client` object in apps. This class has
the information about how to invoke the bt-ee lambda, and handles all
calls to lambda. This includes calls to the canary too (which is a
separate lambda).

The future:

There's one global `cbc_proxy_client`. This knows about the different
provider functions and lambdas, and you'll need to ask this client for a
proxy for your chosen provider. call cbc_proxy_client.get_proxy('ee')`
and it'll return you a proxy that knows what ee's lambda function is,
how to transform any content in a way that is exclusive to ee, and in
future how to parse any response from ee.

The present:

I also cleaned up some duplicate tests.
I'm really not sure about the names of some of these variables - in
particular `cbc_proxy_client` isn't a client - it's more of a java style
factory, where you call a function on it to get the client of your
choice.
2020-11-19 15:50:37 +00:00
Leo Hemsted
7cc83e04eb move BroadcastProvider from models.py to config.py
It's not something that is tied to a database table, and was causing
circular import issues
2020-11-19 15:50:37 +00:00
Toby Lorne
dda71bf685 celery: add task for triggering link tests
We want to periodically kick off some link tests, so that:

- we are periodically communicating with the CBC Proxies
- each CBC Proxy is periodically communicating with its CBC

This simulation of traffic to the CBC will give us advance warning if
something is going to break, or is broken, before someone tries to send
a real live message

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Richard <richard.baker@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
2020-10-27 15:39:28 +00:00
Toby Lorne
be90455944 Add task to send canary to cbc proxy
Create and schedule a Celery task that tests if we
can send a canary message to cbc proxy.

This will help us know if something happens to our
connection to cbc proxy.

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
Co-authored-by: Richard <richard.baker@digital.cabinet-office.gov.uk>
2020-10-27 10:38:09 +00:00
Rebecca Law
e4e61b1e4b Merge pull request #2978 from alphagov/add-reference-to-msg
Add reference to log message
2020-09-28 12:04:59 +01:00
Chris Hill-Scott
aace1bdd8a Allow 20 minutes before checking for missing rows
Since we’ve doubled the number of rows in a job, jobs can take twice as
long to insert all the notifications. We don’t check for missing rows
until we’re pretty confident that the original tasks have finished
processing. This means we need to double the time we wait to still be
as sure.
2020-09-26 11:38:38 +01:00
Rebecca Law
d7e53cdf50 Adding reference to message for "precompiled letters have been pending-virus-check for over 90 minutes"
This saves having to go to the db to get it.
2020-09-24 14:16:16 +01:00
Rebecca Law
dd126df122 We have a scheduled task to check that all the jobs have completed, this will catch if an app is shut down and the job is complete yet, we only wait 10 seconds before forcing the app to shut down.
The task was raising a JobIncompleteError, yet it's not an error the task is performing it's task correctly and calling the appropriate task to restart the job.
Also used apply_sync to create the task instead of send_task.
2020-07-22 17:00:20 +01:00
Pea Tyczynska
5d6f2da155 Rename task from create_letters_pdf to get_pdf_for_templated_letter
In a separate PR we will have to delete vestigial create_letters_pdf
tasks that now only redirects to get_pdf_for_templated_letter.
2020-05-11 13:33:05 +01:00
Leo Hemsted
d59b2f4cc9 make create_letters_pdf always use right queue
create-letters-pdf-tasks is for contacting template preview.
letters-tasks is for doing other things around letters (including putting things on template preview queue)
2020-04-22 16:07:51 +01:00
Katie Smith
bfd40a843c Delete send-scheduled-notifications task
This was added a few years ago, but never used.
2020-04-02 14:45:18 +01:00
David McDonald
87cb02fa1d Add runbook link to resolve letters pending virus scan 2020-03-30 12:07:05 +01:00
Rebecca Law
51f43563d3 Fix serialisation error when creating the create_letters_pdf in resend_created_notifications_older_than 2020-03-17 08:48:02 +00:00
David McDonald
3a0aece6a1 Up threshold for sms to telephone numbers
We were just ignoring the errors and our users were not fixing things.

Given that 500 texts cost approx £8 it's not the end of the world.

In the long run we may decide to just stop letting people try and send
messages to TV numbers but this is a quick fix to stop emails coming in
which we ignore.
2020-01-17 13:26:20 +00:00
David McDonald
61f1469a20 Improve developer experience for zendesk ticket
We don't need to log this as an exception. It's not an exception, it's
behaviour that is not ideal but is still expected so therefore I've
changed it to warn. Also it removes the email we get for the exception
which is not needed as we get the zendesk ticket instead.

I've also fixed the multiline string meaning the link to the runbook is
included in the zendesk ticket.
2019-12-17 11:47:23 +00:00
Pea Tyczynska
c00f82b81b Co-Authored-By: Chris Hill-Scott <me@quis.cc>
Use .format instead of concatenation to avoid type issues

Trying to concatenate uuid onto a string was throwing an error.

Also it is not possible to use uuid in parametrize statements
it seems as it messes up with running tests on multiple threads
2019-12-11 11:18:42 +00:00
Pea Tyczynska
87bc86efa7 Reference dev runbook for instructions in the zendesk ticket 2019-12-06 17:05:43 +00:00
Pea Tyczynska
1b7b26bf24 Query directly for services with high failure rate 2019-12-06 16:57:56 +00:00
Pea Tyczynska
b8de67ae54 Update error message to include a url to offending service 2019-12-06 16:57:54 +00:00
Pea Tyczynska
339b6c0ec7 Refactor a test so it doesn't do query that's tested elsewhere 2019-12-06 16:57:54 +00:00
Pea Tyczynska
cfbb080f57 Simplify failure rate by building separate query 2019-12-06 16:57:44 +00:00
Pea Tyczynska
53efd87e28 Check for services sending sms messages to tv numbers 2019-12-06 16:57:34 +00:00
Pea Tyczynska
d72ab4f4a6 Send zendesk ticket when services found with high failure rates 2019-12-06 16:57:04 +00:00
Leo Hemsted
f7fbd6de5b make 500s change priorities quicker
it's not acceptable for a constantly failing provider to take 50 minutes
to drain (5x reducing priority by 10). But similarly, we need _some_
delay, or a handful of concurrent failures will completely turn off a
provider, rendering the whole excercise kinda pointless. Setting the
delay before it tries to reduce priority again to one minute is nice
because it means that if one request times out and returns 502, then any
other requests that are in flight at that time will time out before the
one minute is up and not switch, but any requests made after the switch
that take sixty seconds to time out will affect it.
2019-11-28 13:29:39 +00:00
Leo Hemsted
cfe82f8f4a make 500 error provider switches also check for recent changes
moving the logic and the test from switch provider on slow delivery to
dao reduce sms provider priority
2019-11-28 13:29:39 +00:00
Leo Hemsted
2a392e7137 update switch provider scheduled task
it now looks at both providers and works out whether to deprioritise
one, rather than binary switching from one to the other. If anything
has altered the priorities in the last ten minutes it won't take any
action. If both providers are slow it also won't take any action.
2019-11-28 13:29:38 +00:00
Leo Hemsted
e29546cb65 flake8 2019-11-28 13:29:02 +00:00
Leo Hemsted
28da190a1c remove get_current_provider
the function no longer makes sense now that we send through both at
the same time. mostly just used in old tests that we'll end up rewriting
shortly anyway
2019-11-28 13:29:02 +00:00
Rebecca Law
4fd6f33af2 Merge pull request #2658 from alphagov/fix-letters-in-created-status
Alert if a letter doesn't make it past created status
2019-11-27 13:38:51 +00:00
Rebecca Law
e0b4b258aa Shortened the length of time to check for messages with the wrong state.
There is a chance that the there is an outstanding retry task that has yet to run but the task that are replayed here protect against the task running twice. So this just means it might get sent sooner than later.
2019-11-21 15:51:27 +00:00
Rebecca Law
ac4f0e8027 After a comment from @idavidmcdonald, I asked myself why are not creating the task to upload the pdf and update the notification.
The assumption was that S3 would throw an exception if the object was uploaded twice. That's not the case the default behaviour is that if a file already exists it will be overwritten. So it is completely safe to run the task from the alert.

It can also mean that we don't need to wait 4hours 15 minutes. Shall I decease the amount of time before restarting the task?
2019-11-19 16:04:21 +00:00
Rebecca Law
918975b0a6 Use sender_id from CSV metadata.
When we upload a CSV for a job, we add the sender_id as metadata to the file that is uploaded on S3.
There is more than one place where we process rows from that CSV.
 - process_job
 - scheduled_job
 - check_for_missing_rows_in_completed_jobs
 - check_job_status

All of these places need to use the sender_id, now the sender_id is always read from the file metadata.
In a subsequent PR we can remove the optional sender_id parameter from process_job task.
2019-11-15 15:42:29 +00:00
Rebecca Law
c42420c329 Add an alert when a letter is created but doesn't have a file in S3 for sending. We can tell this is the case because there is no updated_at and billable units are still 0.
At this point we are just creating a zendesk ticket - perhaps we can just call the create_letter_pdf task.
2019-11-13 16:39:59 +00:00
Rebecca Law
db5a50c5a7 Adding a scheduled task to processing missing rows from job
Sometimes a job finishes but has missed a row in the middle. It is a mystery why this is happening, it could be that the task to save the notifications has been dropped.
So until we solve the missing let's find missing rows and process them.

A new scheduled task has been added to find any "finished" jobs that do not have enough notifications created. If there are missing notifications the job processes those rows for the job.
Adding the new task to beat schedule will be done in the next commit.

A unique key constraint has been added to Notifications to ensure that the row is not added twice. Any index or constraint can affect performance, but this unique constraint should not affect it enough for us to notice.
2019-11-06 10:49:46 +00:00
Katie Smith
38243cf860 Stop calling fixtures as functions in the tests 2019-10-30 13:05:53 +00:00
Rebecca Law
f097abe82b Change the query to get the notifications for the check_templated_letter_state.
Now looking at the updated_at date, we are getting the alert if the notification was created_at:17:29 updated to created status at 17:30, so the letter is in the next days bucket.

Not sure if I want to make this change, there isn't an index on updated_at, so the query might be slow.
2019-08-16 10:37:51 +01:00
Katie Smith
a790acc091 Create a Zendesk ticket for letters in the wrong state
This creates a Zendesk ticket if either the
`check_precompiled_letter_state` or `check_templated_letter_state` tasks
fail.
2019-06-18 10:58:58 +01:00
Katie Smith
c518f6ca76 Add scheduled task to find old letters which still have 'created' status
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
2019-06-18 10:58:58 +01:00