"if you want something done right, then do it yourself"
The ORM was building a weird and inefficient query to get all the users
for a given organisation_id:
old query:
```sql
SELECT users.*
FROM users
WHERE (
EXISTS (
SELECT 1
FROM user_to_organisation, organisation
WHERE users.id = user_to_organisation.user_id AND
organisation.id = user_to_organisation.organisation_id AND
organisation.id = '63b9557c-22ea-42ac-bcba-edaa50e3ae51'
)
) AND
users.state = 'active'
ORDER BY users.created_at
```
Note the cross join on users_to_org and org, there are a lot of users in
organisations on preview, as one is added every functional test run.
Even though they're all filtered out and the query returns the correct
data, it still does the nested loop under the hood.
new query:
```sql
SELECT users.*
FROM users
JOIN user_to_organisation AS user_to_organisation_1 ON users.id = user_to_organisation_1.user_id
JOIN organisation ON organisation.id = user_to_organisation_1.organisation_id
WHERE organisation.id = '63b9557c-22ea-42ac-bcba-edaa50e3ae51' AND users.state = 'active' ORDER BY users.created_at;
```
Much more straightforward.
* it doesn't delete service email reply to or letter contacts, or
contact lists. I don't think the contact lists will ever be an issue
but it doesn't hurt to add it to the list of things to remove.
* it doesn't remove users from organisations before deleting the users
there may be more tables that link to Service that should be deleted,
but for now just add these ones that I could spot.
This task didn't have retries before, based on the assumption that if
the task failed it was likely to be due to a Boto error, so retrying
would not help because a file was probably not in the expected bucket.
During an incident with the database, we had some letters that were
stuck in the `pending-virus-check` state because this task failed.
This change adds retries to the task if there was an Exception that was
not a `BotoClientError`.
SQS fails if the message body is over 256kb. Normally our messages are
quite small, but if we're using the new save-api-email task with an
email that has a large body, we can get over that limit. If so, handle
the exception and fall back to the existing code path (saving to the
database and sending a deliver-email task, which only has a notification
id.
It is possible a service has data rention that is smaller than the time it takes to get a delivery receipt.
This PR refactors process_ses_receipt to update NotificationHistory if the Notifcation has already been purged.
So we keep a record of who first uploaded a list it’s better to archive
a list than completely delete it.
The list in the database doesn’t contain any recipient info so this
isn’t a change to what data we’re retaining.
This means updating the endpoints that get contact lists to exclude ones
that are archived.
Once a contact list is gone from the database there’s no way to
reference it again. Any jobs have made their own copy.
So we can clean it up, meaning we’re not storing personal data longer
than we need to.
This was one of things we de-scoped when we first shipped this feature.
In order to safely delete a list, we first need to make sure any jobs
aren’t referencing it.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.
In a nutshell:
- If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
- create a save_api_email task to persist the notification
- return the notification
- New worker app to process the save_api_email tasks.
- fix test name
- make query filter consistent with each other
- add comment for clarity
- add inner loop to continue to insert and delete notifications while the delete count is greater than 0
The insert_notification_history_delete_notifications function uses a temp table to store the data to insert and delete. This will save extra queries while performing the insert and delete operations.
The function is written in such a way that if the task is stop while processing when it's started up again it will just pick up where it left off.
I've made a decision to delete all test data in one query, I don't anticipate a problem with that.
The performance of this might also be better than last nights test because we are inserting everything we need for the NotificationHistory insert, so we don't need the join to Notifications to perform the insert.
Need to test the performance of this function, then we can call it from the task.
- Create a temporary table to insert ids of the desired rows, limit by 10K (might be too low).
- Insert into NotifcationHistory select from notification where id in temp table
- Delete from Notifications where id in temp table
- drop temp table
We should be able to iterate of this. The query stats for the query to create the temp table are very good, 17ms.
you should always order by when doing a limit/offset, to guarantee that
the second time you run that query, the order hasn't changed and you
aren't just repeating the task with an overlap of notifications.
Luckily, in this case we haven't lost any data, because:
* we have an on conflict do update statement so when we returned
duplicate rows it would just do an update
* when we delete, we cross-reference with the notification history so if
a row always got missed, we won't delete it.
This resulted in, for example, govuk email still having a handful of
notifications in the table from 9th despite the task running succesfully
every day until the 18th of march.
order by created_at ascending so that we start with oldest notifications
first, in case it's interrupted part way through.
insert/update, and then delete notifications in hourly batches. This
means that if the task gets interrupted part-way through, we'll have at
least something to show for it. Previously we would insert and update
into the history table but might not delete from the notification table
properly.
Keeping the offsets and limits for confidence around reliability and
queries timing out.
Keeping the join to notification_history to ensure we don't delete
anything prematurely while our DB is in a bit of a weird state with lots
of these tasks failing over the last week.
This runs on the new `sms-callbacks` queue. The function
`process_sms_client_response` has been replaced with a task called
`process_sms_client_response`. This involved some reorganisation of the
existing code and tests.