Commit Graph

851 Commits

Author SHA1 Message Date
David McDonald
070b79c27e Downgrade exceptions to warnings to reduce emails
We already trigger a zendesk ticket for these two cases, meaning that
whenever we get this situation, we get 3 emails. One for the zendesk
ticket, one from logit raising the fact an exception was raised and one
from cloudwatch raising the fact an exception was raised.

We don't need all these emails, a zendesk ticket is sufficient.
Downgrading to a warning means this event will still be findable in our
logs however.
2021-02-02 15:10:26 +00:00
David McDonald
86ea89cf76 Merge pull request #3098 from alphagov/downgrade-to-warning
Downgrade SMS provider request exceptions to warnings
2021-01-29 11:52:10 +00:00
David McDonald
ac6837cde5 Downgrade exception to warning for provider API call
When we send an HTTP request to our SMS providers, there is a
chance we get a 5xx status code back from them. Currently we log this as
two different exception level logs.

If a provider has a funny few minutes, we could end up with
hundreds of exceptions thrown and pagerduty waking someone up in the
middle of the night. These problems tend to pretty quickly fix
themselves as we balance traffic from one SMS to the other SMS provider
within 5 minutes.

By downgrading both exceptions to warning in the case of a
`SmsClientResponseException`, we will reduce the change of waking us up
in the middle of the night for no reason.

If the error is not a `SmsClientResponseException`, then we will still
log at the exception level as before as this is more unexpected and we
may want to be alerted sooner.

What we still want to happen though is that let's say both SMS providers
went down at the same time for 1 hour. We don't want our tasks to just
sit there, retrying every 5 minutes for the whole time without us being
aware (so we can at least raise a statuspage update). Luckily we will
still be alerted because our smoke tests will fail after 10 minutes and
raise a p1:
https://github.com/alphagov/notifications-functional-tests/blob/master/tests/functional/staging_and_prod/notify_api/test_notify_api_sms.py#L21
2021-01-18 17:00:21 +00:00
Chris Hill-Scott
4eb4ea1772 Use cache for tasks that save notifications
These tasks need to repeatedly get the same template and service from
the database. We should be able to improve their performance by getting
the template and service from the cache instead, like we do in the REST
endpoint code.
2021-01-18 10:25:24 +00:00
David McDonald
20627d96ea Put all broadcast tasks on the broadcast worker 2021-01-13 17:21:40 +00:00
David McDonald
c3ef23c771 Alert on 2nd class letters still in sending everyday
In 8285ef5f89
we turned off alerting on 2nd class letters still being in sending on
certain days of the week because we were only sending letters out on
Mon, Wed, Fri.

Now we have swapped back to sending out 2nd class letters on all
workdays so this change can be reverted. Note, I haven't reverted the
commit exactly but more so the behaviour, whilst leaving in some tests
to explicitly test 2nd class letters for the alert in case we change
this again.
2021-01-13 11:21:27 +00:00
David McDonald
977554781f Add better logging message for tech failure
So we can easily identify which notification ID failed
2020-12-30 17:28:21 +00:00
David McDonald
2480f91667 Raise better exception on InvalidParameterValue error
There are several reasons why we might get an `InvalidParameterValue`
from the SES API. One, as correctly identified before in
https://github.com/alphagov/notifications-api/pull/713/files
is if we allow an email address on our side that SES rejects.

However, there are other types of errors that could cause an
`InvalidParameterValue`. One example is a `Header too long: 'Subject'`
error that we have seen happen in production. This shouldn't raise an
`InvalidEmailError` as that is not appropriate.

Therefore, we introduce a new exception
`EmailClientNonRetryableException`, that represents any exception back
from an email client that we can use whenever we get a
`InvalidParameterValue` error.

Note, I chose `EmailClientNonRetryableException` rather than
`SESClientNonRetryableException` as our code needs to catch this
exception and it shouldn't be aware of what email client is being used,
it just needs to know that it came from one of the email clients (if in
time we have more than one).

In time, we may wish to extend the approach of having generic
`EmailClient` exceptions and `SMSClient` exceptions as this should be
the most extendable pattern and a good abstraction.
2020-12-30 17:18:16 +00:00
Rebecca Law
a2bb775b6f Merge pull request #3069 from alphagov/add-request-id-if-in-context
Pass request_id onto the task if called from a task
2020-12-23 12:26:04 +00:00
Rebecca Law
a1b31a6c20 Check for app_context and request in g to prevent Attribute Errors.
We can add a request_id for tasks that are not spawned by an HTTP request, for example scheduled or nightly tasks. That means you can match up all the tasks spawned by a single task, for example, create-night-billing spawns 4 tasks, those would all have the same idea. Not sure if that is helpful or not. Also it might be confusing to have a request_id for logs that were not started from a request so I have left it out.
2020-12-23 09:47:47 +00:00
Rebecca Law
025b51c801 If the request_id exists in the Flask global context, add it to the kwargs for the task.
The request_id is set is the task is created from a http request, if that task then calls through to another task this will set the request_id from the global context. We should then be able to follow the creation of a notification all the way from the original http request to the sending task.
2020-12-22 15:21:32 +00:00
David McDonald
fae7e917b5 Fix logging line to include response context
We have been getting log lines of the following:

`API POST request on
https://api.notifications.service.gov.uk/notifications/sms/mmg failed
with None`

It's not clear what error caused the request to fail because the value
of `api_error.response` is always `None`.

There appears to be something wrong with this logging.
`raise_for_status` will raise an `HTTPError`, so then there should be no
reason to then pass that error into another `HTTPError` (which is
causing the response to be lost).

We can instead simply catch the `HTTPError` and log it's status
code.

This might not be perfect, but it's definitely an improvement and should
give us some more context about why these requests occasionally fail.
2020-12-21 14:39:10 +00:00
Pea Tyczynska
95deb5a52f Move DATETIME_FORMAT from app to app.utils
To avoid cyclical import issues
2020-12-18 17:39:35 +00:00
Pea Tyczynska
45b806f6db Remove unused args from cancel broadcast call in tasks 2020-12-14 11:31:05 +00:00
Pea M. Tyczynska
a70b7c521e Merge pull request #3053 from alphagov/ibag-message-number
Add sequential message number to broadcast provider messages
2020-12-09 13:02:25 +00:00
Pea Tyczynska
def7a16765 Establish relation between provider message and message number
this is so we can access brodcast_provider_message_number from
BroadcastProviderMessage object
2020-12-09 11:41:22 +00:00
Pea Tyczynska
8af4b27fd6 Separate functions for cbc clients
Also move message_format to the clients.
2020-12-09 11:13:50 +00:00
Pea Tyczynska
553565bc91 Send message format to CBC
Either cap or ibag
2020-12-08 11:15:26 +00:00
Leo Hemsted
9502f17d84 flake8 fixes
a stricter flake8 bump. mostly things around f strings and format
strings, but a couple of bad placeholder names in loops
2020-12-07 15:24:02 +00:00
Pea Tyczynska
2952b70930 Only create sequential numbers for Vodafone messages 2020-12-07 13:13:13 +00:00
Pea Tyczynska
e95dc9450e Include message number in send_broadcast_provider_message 2020-12-07 13:13:12 +00:00
Pea Tyczynska
a186d2d296 Format sequential number into an 8 char long hex
As per Vodafone spec for ibag format message number
2020-12-07 13:13:11 +00:00
Pea Tyczynska
b34bffaae6 Sends sequential number to Vodafone as link test 2020-12-07 13:13:11 +00:00
Leo Hemsted
fd335e3d8b move available provider logic to the service model
make sure it's in an accessible place so we don't end up duplicating our
work
2020-12-03 22:50:50 +00:00
Leo Hemsted
72f8a15d4f respect service broadcast provider restrictions when sending 2020-12-03 13:39:09 +00:00
Leo Hemsted
e2fa0116a0 add CBC_PROXY_ENABLED config flag to control if tasks are triggered
previously we made some incorrect assumptions about set-up on staging
and prod - they currently don't have any cbc_proxy aws creds at all.

We shoudn't be attempting canaries or link tests when there's no AWS
infrastructure to connect to.

We also shouldn't bother writing a row into the database at all for the
broadcast_provider_message since we're not even attempting to send, and
we shouldn't get confused between messages that failed and messages we
never wanted to send at all.
2020-11-26 10:16:22 +00:00
Leo Hemsted
54fecf2182 Merge pull request #3035 from alphagov/broadcast-event-response
Send broadcast events per provider
2020-11-25 10:16:30 +00:00
David McDonald
43f1f48093 Add notification ID to SES bounce reason
At the moment we log everytime we get a bounce from SES, however we
don't link it to a particular notification so it's hard to know for what
sub reason a notifcation did not deliver by looking at the logs.

This commit changes this by now looking the bounce reason after we have
found the notification ID and including them together. So if you know
search for a notification ID in Kibana, you will see full logs for why
it failed to deliver.
2020-11-20 14:10:13 +00:00
Leo Hemsted
087cc5053d separate cbc proxy into separate clients
this is a pretty big and convoluted refactor unfortunately.

Previously:

There was one global `cbc_proxy_client` object in apps. This class has
the information about how to invoke the bt-ee lambda, and handles all
calls to lambda. This includes calls to the canary too (which is a
separate lambda).

The future:

There's one global `cbc_proxy_client`. This knows about the different
provider functions and lambdas, and you'll need to ask this client for a
proxy for your chosen provider. call cbc_proxy_client.get_proxy('ee')`
and it'll return you a proxy that knows what ee's lambda function is,
how to transform any content in a way that is exclusive to ee, and in
future how to parse any response from ee.

The present:

I also cleaned up some duplicate tests.
I'm really not sure about the names of some of these variables - in
particular `cbc_proxy_client` isn't a client - it's more of a java style
factory, where you call a function on it to get the client of your
choice.
2020-11-19 15:50:37 +00:00
Leo Hemsted
0257774cfa add get_earlier_provider_message fn to broadcast_event
replacing get_earlier_provider_messages. The old function returned the
previous references for earlier events for a broadcast_message. However,
these depend on the message sent to a specific provider, so the function
needs to change. It now takes in a provider, and only returns
broadcast_provider_messages sent to that provider. If there are earlier
broadcast_events without a provider_message for the chosen provider, it
raises an exception - you cannot cancel a message if all the previous
events have not been created properly (as we wouldn't know what
references to cancel).
2020-11-19 15:50:37 +00:00
Leo Hemsted
f12c949ae9 create broadcast_provider_message and use id from that instead
(instead of using the id from broadcast_event)

we need every XML blob we send to have a different ID. if we're sending
different XML blobs for each provider, then each one should have a
different identifier. So, instead of taking the identifier from the
broadcast_event, take it from the broadcast_provider_message instead.

Note: We're still going to the broadcast_event for most fields, to
ensure they stay consistent between different providers. The last thing
we want is for different phone networks to get different content
2020-11-19 15:50:37 +00:00
Leo Hemsted
7cc83e04eb move BroadcastProvider from models.py to config.py
It's not something that is tied to a database table, and was causing
circular import issues
2020-11-19 15:50:37 +00:00
Leo Hemsted
bc3512467b send messages to multiple providers
at the moment only EE is enabled (this is set in app.config, but also,
only EE have a function defined for them so even if another provider was
enabled without changing the dict in cbc_proxy.py we won't trigger
anything). this commit just adds wrapper tasks that check what providers
are enabled, and invokes the send function for each provider.

The send function doesn't currently distinguish between providers for
now - as we only have EE set up. in the future we'll want to separate
the cbc_proxy_client into separate clients for separate providers.
Different providers have different lambda functions, and have different
requirements. For example, we know that the two different CBC software
solutions handle references to previous messages differently.
2020-11-19 15:50:37 +00:00
David McDonald
224d9bf35a Log when we don't retry a callback
We don't retry any callbacks when it receives a 4xx status. We should
probably be aware of this happening and at the moment there is nothing
in our logs to easily identify whether the request failed and is being
retried or if it failed and is not being retried. This will enable us to
search our logs easily and figure out how much it's happening.

It's quite likely that we should in the future allow callbacks to retry
if they get a 429 http response (rate limiting) but we should do this in
a smart way (exponential backoff) and so this is a first step to being
aware of how big a problem it is in case we want to do something about
it.
2020-11-17 11:26:32 +00:00
Rebecca Law
29b6f84f6c Revert "Revert "Add a task to save-api-sms for high volume services."" 2020-10-29 11:12:46 +00:00
Toby Lorne
dd012d6831 client: cbc_proxy passes through sent/expires
A BroadcastEvent knows when an event was sent and should expire

We pass through these values directly to the CBC Proxy, because
BroadcastEvent knows how they should be formatted

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
2020-10-28 11:37:06 +00:00
Toby Lorne
dda71bf685 celery: add task for triggering link tests
We want to periodically kick off some link tests, so that:

- we are periodically communicating with the CBC Proxies
- each CBC Proxy is periodically communicating with its CBC

This simulation of traffic to the CBC will give us advance warning if
something is going to break, or is broken, before someone tries to send
a real live message

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Richard <richard.baker@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
2020-10-27 15:39:28 +00:00
Pea Tyczynska
5cff30aa47 Send a log message informing that we are sending canary to CBC proxy 2020-10-27 10:45:32 +00:00
Toby Lorne
be90455944 Add task to send canary to cbc proxy
Create and schedule a Celery task that tests if we
can send a canary message to cbc proxy.

This will help us know if something happens to our
connection to cbc proxy.

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
Co-authored-by: Richard <richard.baker@digital.cabinet-office.gov.uk>
2020-10-27 10:38:09 +00:00
Toby Lorne
d22adb7813 broadcasts: do not send description to cbc_proxy
"areas" and "simple_polygons" in "transmitted_areas" do not have the
same length

as an example, choosing the area "england" results in a single item in
"areas" but many polygons in "simple_polygons"

therefore zipping these two together gives a list of areas:
* of length 1
* containing only new grimsby

which is not what we want

as the CBC does not care about the areaDesc field within CAP, we should
omit it from the function invocation and delegate the contents of
areaDesc to the CBC Proxy implementation

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Richard <richard.baker@digital.cabinet-office.gov.uk>
Co-authored-by: David <david.mcdonald@digital.cabinet-office.gov.uk>
2020-10-26 15:16:33 +00:00
Leo Hemsted
902a2dde94 Merge pull request #3014 from alphagov/get-query-in-50k-batches
Optimise dao_get_letters_to_be_printed query
2020-10-26 13:23:17 +00:00
Toby Lorne
fdacb2e0d7 broadcasts: remove references to cbc stub
We are phasing out our cbc-proxy stub which displayed CAP XML messages

We are in the process of testing with real CBCs, so maintaining our own
stub is not useful

This commit
* removes the HTTP POST requests to the CBC proxy
* writes up the update/cancel methods of the cbc_client (not impl)

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
2020-10-26 10:23:49 +00:00
Leo Hemsted
ed182c2a22 return just the columns we need for collating letters
previously we were returning the entire ORM object. Returning columns
has a couple of benefits:

* Means we can join on to services there and then, avoiding second
  queries to get the crown status of the service later in the collate
  flow.
* Massively reduces the amount of data we return - particularly free
  text fields like personalisation that could be potentially quite big.
  5 columns rather than 26 columns.
* Minor thing, but will skip some CPU cycles as sqlalchemy will no
  longer construct an ORM object and try and keep track of changes. We
  know this function doesn't change any of the values to persist them
  back, so this is an unnecessary step from sqlalchemy.

Disadvantages are:

* The dao_get_letters_to_be_printed return interface is now much more
  tightly coupled to the get_key_and_size_of_letters_to_be_sent_to_print
  function that calls it.
2020-10-23 20:01:18 +01:00
Toby Lorne
aa002afd31 clients: cbc_proxy actions accepts areas param
related:
https://github.com/alphagov/notifications-broadcasts-infra/pull/23

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
2020-10-23 17:09:00 +01:00
Leo Hemsted
4b61060d32 stream notifications when collating zip files
we had issues where we had 150k 2nd class notifications, and the collate
task never ran properly, presumably because the volume of data being
returned was too big.

to try and help with this, we can switch to streaming rather than using
`.all` and building up lists of data. This should help, though the
initial query may be a problem still.
2020-10-23 12:20:26 +01:00
David McDonald
3dcb97c45a Remove 'INSOLVENCY' from zip files for insolvency letters
This is at request of DVLA. They would prefer to have zip files with the
same number of arguments in the name. After being offered a few
different options, such as including an org and service id for all zips,
they chose to just remove the 'INSOLVENCY' tag.

For more context see PR that added the tag
https://github.com/alphagov/notifications-api/pull/3006
2020-10-23 09:58:28 +01:00
David McDonald
890a66de46 Merge pull request #3003 from alphagov/staging-preview-cbc
Non-production environments invoke CBC Proxy during broadcast event creation
2020-10-22 16:21:11 +01:00
Pea Tyczynska
deb360846d Mark letters from insolvency service
So that DVLA knows to process them separately to avoid issues.
2020-10-21 16:56:18 +01:00
Pea Tyczynska
d745ba310e Divide letters by service when putting in ZIPs
When letters are sent to DVLA, we will now put them in a separate
ZIP file for each service, so that if there are printing issues
due to bad files from one service, other services will hopefully
not be affected by that.
2020-10-21 15:19:06 +01:00
Toby Lorne
2cc0a65851 celery: broadcast msg create invokes cbc proxy
When we create a broadcast message, we should invoke the cbc proxy to
send a cap message

Either a function will be invoked within AWS, or a noop function call
is made, depending on the environment

We have only implemented CB message creation in the CBC Proxy, without
polygons, therefore we:
* only invoke the CBC Proxy during message creation
* only send description, identifier, and hard-coded headline

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Pea <pea.tyczynska@digital.cabinet-office.gov.uk>
Co-authored-by: Katie <katie.smith@digital.cabinet-office.gov.uk>
2020-10-20 11:57:26 +01:00