Commit Graph

8612 Commits

Author SHA1 Message Date
Pea Tyczynska
a4c20e8ba6 Return 404 if reference from cancel message does not match
If the reference from cancel CAP XML we received via API does not
match with any existing broadcast, return 404.

Do the same if service id doesn't match.

Also refactor code to cancel broadcast out into separate function

It should be a separate function that is only called by create_broadcast
function. This will prevent create_broadcast from becoming too
big and complex and doing too many things.
2022-01-19 15:42:27 +00:00
Pea Tyczynska
3b4a9d8942 Cancel broadcast via API
When a service sends us an XML CAP broadcast message with Cancel
status, and that broadcast is in broadcasting state, we cancel it.
2022-01-19 15:42:26 +00:00
Pea Tyczynska
940126abfb Reject unapproved broadcast upon cancel API request
When a service sends us a cancel broadcast XML via API, if that
broadcast was not approved yet, reject it.
2022-01-19 15:41:38 +00:00
Katie Smith
5cd6fcbb4f Merge pull request #3423 from alphagov/org-user-delte
Add endpoint to allow org team members to be removed
2022-01-13 08:39:32 +00:00
Katie Smith
ed725c1513 Add endpoint to allow org team members to be removed
This is similar to the corresponding endpoint for services. However,
it is a little simpler since we don't need to worry about always having
at least one team member for an organisation.

The new dao function added, `dao_remove_user_from_organisation`, is also
simpler than `dao_remove_user_from_service` since we don't have any
organisation permissions to deal with.
2022-01-11 15:20:48 +00:00
Ben Thorner
081e0cab88 Merge pull request #3417 from alphagov/optimise-status-query-180693991
Optimise query to populate notification statuses
2022-01-11 14:18:36 +00:00
Ben Thorner
63b5204fb0 Optimise query to populate notification statuses
Investigation with EXPLAIN and EXPLAIN ANALYZE for the notification
history table shows this is another instance of [1] but for the key
type column. Swapping "!=" for "IN" solves the problem.

[1]: https://github.com/alphagov/notifications-api/pull/3360
2022-01-11 13:22:04 +00:00
Ben Thorner
e4dcea5396 Merge pull request #3421 from alphagov/explain-status-task-180693991
Add comment to explain status aggregation approach
2022-01-11 12:33:38 +00:00
Rebecca Law
ff7ee2cb63 Merge pull request #3422 from alphagov/fix-organisation-billing-query
Fix bug in organisation report for its services and usages.
2022-01-11 11:43:01 +00:00
Rebecca Law
2257cae398 Fix bug in organisation report for its services and usages.
If a service has not sent any SMS for the financial year the free allowance was showing up as 0 rather than the number in annual billing. The query has been updated to use an outer join so that the free allow will be returned when there is no ft_billing.

There is a potential performance enhancement to only return the data for the services of the organisation in the `fetch_sms_free_allowance_remainder_until_date` subquery. I will investigate in a subsequent PR.
2022-01-11 10:04:36 +00:00
Ben Thorner
a7b39a930c Add comment to explain status aggregation approach
This relates to the performance optimisation work we're doing [1].
Before optimising the task, it's worth asking if we can do less -
the comment explains why it has to be this way.

Some references to back up the comment:

- We do status updates in either table [2].
- We don't allow duplicate receipts for emails [3].
- We don't allow duplicate receipts for SMS [4].
- We don't expect duplicate receipts for letters.

This is something we would need to revisit if we want to support
additional status updates - we could reject based on the age of the
notification, rather than the status.

[1]: https://github.com/alphagov/notifications-api/pull/3417
[2]: 20ead82463/app/dao/notifications_dao.py (L538)
[3]: 20ead82463/app/celery/process_ses_receipts_tasks.py (L58)
[4]: 20ead82463/app/dao/notifications_dao.py (L129-L135)
2022-01-10 18:15:54 +00:00
Ben Thorner
394bf9abd9 Extend test for updating fact statuses
This covers that we only exclude test notifications and the key
type is copied over correctly. In the next commits we're going to
modify this part of the query, so it's important it's covered.
2022-01-05 16:49:30 +00:00
Katie Smith
20ead82463 Merge pull request #3403 from alphagov/get-notis-post
Allow `get_all_notifications_for_service` to accept POST requests
2022-01-04 14:26:52 +00:00
Katie Smith
13b6d1e490 Remove unused test function
`set_up_get_all_from_hash` stopped being used in 52831813d8
2022-01-04 14:04:03 +00:00
Katie Smith
3530d26ba3 Use client fixture everywhere
There were a few tests which weren't using the `client` fixture but were
using the code it contains. This simplifies them to use the fixture.
2022-01-04 14:04:03 +00:00
Katie Smith
0b7410818e Allow get_all_notifications_for_service to accept POST requests
We want admin to send a POST request to this route if the data contains
a message recipient (a phone number or email address) so that this does
not show in the logs. This changes the route to accept both GET and POST
requests.
2022-01-04 14:04:03 +00:00
Ben Thorner
494a01ba57 Merge pull request #3415 from alphagov/standard-freeze-180760212
Centralise documentation for updating dependencies
2021-12-29 16:01:08 +00:00
Ben Thorner
c03647fb4b Centralise documentation for updating dependencies
This follows the convention established in [1].

[1]: https://github.com/alphagov/notifications-antivirus/pull/83
2021-12-29 14:59:38 +00:00
Richard Baker
ead9814af9 Merge pull request #3412 from alphagov/reduce-db-pool-size
Reduce pool size from 30 to 15 connections
2021-12-24 11:40:37 +00:00
Richard Baker
10c09338c3 Merge pull request #3413 from alphagov/increase-timeout-more
Bump sqlalchemy statement timeout even higher for reporting worker
2021-12-24 09:46:33 +00:00
David McDonald
edad1c9a21 Bump sqlalchemy statement timeout even higher for reporting worker
We saw it fail again last night to calculate how many notifications
were sent for one of our services to put in the ft_notification_status
table. It ran in to the sqlalchemy statement timeout again.
To get us through the holiday
period lets make it 2 hours as surely that will be enough and then
we can fix this properly
2021-12-24 08:56:42 +00:00
sakisv
ad8cf3f3a6 Reduce pool size from 30 to 15 connections
Having a pool size of 30 connections means that if we receive a big
number of requests, with the current configuration, the API would end up
holding onto 30 connections per worker * 4 workers per instance * 35
instances = 4200 connections. With a limit of 5000 connections, this
means that we would only have 800 connections to share between the
workers or for overflow usage (btw, even the overflow for the API would
take us above the 5000 limit - 10 overflow connections per worker * 4 *
35 = 1400 connections, total 5600 _only_ for the API).

During our load tests this led to a deadlock situation where nothing
could retrieve connections to deal with a queue build-up.

The reduced pool size allowed for a much more graceful degradation of
the service where, after significant load we would increase the response
times but still manage to serve all the requests.
2021-12-23 19:28:17 +02:00
Rebecca Law
77084533fb Merge pull request #3411 from alphagov/increase-timeout-for-reporting-worker
Increase the SQL timeout for the `notify-delivery-worker-reporting` app.
2021-12-23 11:49:28 +00:00
Rebecca Law
603acc8b1e Increase the SQL timeout for the notify-delivery-worker-reporting app.
When running the night reporting tasks we are seeing that some tasks are failing because the query is timing out. We need to revisit how to optimise the query but this will at least let the process finish.
2021-12-23 11:41:49 +00:00
David McDonald
3a214da379 Merge pull request #3408 from alphagov/db-connection-close
Close DB connection whilst making HTTP to SMS providers
2021-12-22 11:02:13 +00:00
David McDonald
2584946823 Close DB connection whilst making HTTP to SMS providers
At the moment, when we are processing and sending an SMS we open
a DB connection at the start of the celery task and then close it
at the end of the celery task. Nice and simple.

However, during that celery task we make an HTTP call out to our
SMS providers. If our SMS providers have problems or response times
start to slow then it means we have an open DB connection sat waiting
for our SMS providers to respond which could take seconds. If our
SMS providers grind to a halt, this would cause all of the
celery tasks to hold on to their connections and we would run out
of DB connections and Notify would fall over.

We think we can solve this by closing the DB session which releases
the DB connection back to the pool.

Note, we've seen this happen in staging during load testing if our
SMS provider stub has fallen over. We've never seen it in production
and it may be less unlikely to happen as we are balancing traffic
across two providers and they generally have very good uptime.

One downside to be aware of is there could be a slight increase in
time spent to send an SMS as we will now spend a bit of extra time
closing the DB session and then reopening it again after the HTTP
request is done.

Note, there is no reason this approach couldn't be copied for our
email provider too if it appears successful.
2021-12-21 17:45:53 +00:00
Pea Tyczynska
32cd7a0eb6 Merge pull request #3395 from alphagov/fix_org_usage_report
Fix calculating remaining free allowance for SMS
2021-12-21 15:02:54 +00:00
Pea Tyczynska
d334e405c5 Refactor tests for sms remainder to make them easier to read 2021-12-21 14:43:56 +00:00
Ben Thorner
e55b654a0b Merge pull request #3407 from alphagov/downgrade-inbound-log
Downgrade log about orphaned inbound SMS
2021-12-21 13:36:10 +00:00
Ben Thorner
f65fb519c7 Merge pull request #3404 from alphagov/remove-redundant-conditional-180477467
Remove redundant conditional for letter branding
2021-12-21 13:35:59 +00:00
Ben Thorner
3d30965193 Downgrade log about orphaned inbound SMS
We can't control who might be sending messages on inbound numbers
that we own i.e. this log isn't an actionable error. Looks like it
used to represent something that _was_ an error [1], but that's not
the case anymore, so it seems reasonable to downgrade it.

[1]: d99ab329eb (diff-80d123d9abb40f80a221979940657a2751cc7cb33f255aa8f352a8324023e022L125)
2021-12-21 12:49:00 +00:00
Ben Thorner
c52cb4a8a8 Merge pull request #3406 from alphagov/bump-utils-51-3-0-180693991
Bump utils to 51.3.0
2021-12-20 16:59:43 +00:00
Ben Thorner
491b7ce9ee Bump utils to 51.3.0
This brings in new logging for the NotifyCelery base class [1].

[1]: https://github.com/alphagov/notifications-utils/pull/938
2021-12-20 16:45:47 +00:00
Ben Thorner
f4d967c0f1 Merge pull request #3405 from alphagov/downgrade-delete-letter-log-180692253
Downgrade log for letter deletion exceptions
2021-12-20 13:39:24 +00:00
Ben Thorner
de9ae08ecc Downgrade log for letter deletion exceptions
If the S3 object is missing [1], then that's what we want, so we
don't need such a severe log for it, but we still want to know as
it's not expected. This is separate to more general "ClientError"
exceptions, which could mean anything.

There weren't any tests to cover missing S3 objects, so I've added
one. I don't think we need a test for ClientErrors:

- If there was no handler, the task would fail and we'd learn about
it that way.

- The scope of the calling task is now much smaller, so it matters
less than it used to [2].

[1]: 81a79e56ce/app/letters/utils.py (L52)
[2]: f965322f25
2021-12-20 12:45:48 +00:00
Ben Thorner
76da31c32a Remove redundant conditional for letter branding
This is no longer used when creating a service [1]. It was likely
added at a migration point when Admin _did_ specify branding.

[1]: 50c3c3e10c/app/main/views/add_service.py (L15-L22)
2021-12-16 17:54:33 +00:00
Pea Tyczynska
6c04deaec2 Get rid of unnecessary coalesce 2021-12-14 17:36:03 +00:00
Leo Hemsted
81a79e56ce Merge pull request #3397 from alphagov/delete-per-service
run the notification delete task per service
2021-12-14 15:37:22 +00:00
Leo Hemsted
228d72dc8f update log messages in delete task.
less prose, clearer output. (hopefully)
2021-12-14 15:24:35 +00:00
Leo Hemsted
0dc0e184b9 clean up and rewrite notification_dao_delete_notifications
a bunch of these tests are now covered in the task test, so got rid of
some. Now that the "how long ago to delete" questions is asked in the
task rather than in the dao, and only one service is looked at at a
time, we don't need to worry about data retention, etc. Hopefully made
the tests simpler - there may still be some duplicates or overlaps
between the various cases.
2021-12-14 15:24:35 +00:00
Leo Hemsted
49cc1b643f split delete task up into per service
we really don't gain anything by running each service delete in sequence
- we get the services, and then just loop through them deleting per
service. By deleting per service in separate tasks, we can take
advantage of parallelism. the only thing we lose is some log lines but I
don't think we're that interested in them.

only set query limit at the move_notifications dao function - the task
doesn't really care about the technical implementation of how it deletes
the notifications
2021-12-14 15:24:34 +00:00
Leo Hemsted
bbc68293bb Merge pull request #3400 from alphagov/lxml-bump
bump lxml to fix security warning
2021-12-14 14:53:32 +00:00
Ben Thorner
7dd3e1fa87 Merge pull request #3399 from alphagov/timeout-stats-180344294
Add new metrics for slow / unknown delivery
2021-12-14 14:02:34 +00:00
Leo Hemsted
d916b07e80 remove old unused scripts
common_functions is full of AWS commands to manipulate workers running
on ec2 instances. We haven't done any of that for years since we moved
to AWS

delete_sqs_queues contains scripts to get a list of sqs queues and put
their details in a csv, or take a details csv and then delete all those
queues.

it's not clear what the use-case was for it but no-one's used it for
years and we can just use the admin console if we really need to.
2021-12-14 14:02:28 +00:00
Leo Hemsted
b7c1fcb66d bump lxml to fix security warning
two vulnerabilities in <4.6.5 (GHSL-2021-1037 and GHSL-2021-1038)
https://github.com/lxml/lxml/blob/master/CHANGES.txt

also removes docopt as we don't use it except for a dev script (which we
might not need anyway)
2021-12-14 13:47:38 +00:00
Ben Thorner
c8cf057eba Record providers we time out notifications for
This will help us monitor issues with delivery receipts and keep
track of provider performance over time.

I'm not concerned about performance here:

- The number of notifications to time out is usually small.
- This task only runs once a day.
- Calls to StatsD are quick and cheap.
2021-12-14 13:04:39 +00:00
Ben Thorner
11278c47f5 Replace log with StatsD gauge for slow delivery
A gauge is more useful as we can visualise it and combine it with
other stats - we already have other stats for the total number of
notifications sent by provider, and we can extrapolate the number
of slow notifications using this, if needed.

We also still have logs to say the task is running, as well as a
log in the calling code when we actually make a switch [1], so
we're not losing anything by removing the log here.

[1]: a9306c4557/app/celery/scheduled_tasks.py (L117)
2021-12-14 13:03:43 +00:00
Ben Thorner
a9306c4557 Merge pull request #3398 from alphagov/infinity-timeout-180344153
Scale timeout task to work on arbitrary volumes
2021-12-14 11:21:26 +00:00
Ben Thorner
c1f0c24d82 Trim down tests for DAO timeout function a bit
The first test is enough to cover that "created" and "delivered"
notifications aren't affected by this function.
2021-12-13 17:17:41 +00:00
Ben Thorner
87cd40d00a Scale timeout task to work on arbitrary volumes
Previously this was limited to 500K notifications. While we don't
expect to reach this limit, it's not impossible e.g. if we had a
repeat of the incident where one of our providers stopped sending
us status updates. Although that's not great, it's worse if our
code can't cope with the unexpectedly high volume.

This reuses the technique we have elsewhere [1] to keep processing
in batches until there's nothing left. Specifying a cutoff point
means the total amount of work to do can't keep growing.

[1]: 2fb432adaf/app/dao/notifications_dao.py (L441)
2021-12-13 17:14:28 +00:00