Once a contact list is gone from the database there’s no way to
reference it again. Any jobs have made their own copy.
So we can clean it up, meaning we’re not storing personal data longer
than we need to.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.
In a nutshell:
- If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
- create a save_api_email task to persist the notification
- return the notification
- New worker app to process the save_api_email tasks.
We moved from sending statsd metrics to hosted graphite to sending to
one that is running on the paas. Therefore we no longer need to send
statsd metrics to a particular prefix at the statsd app as it is only
receiving statsd metrics from our apps (not other users like would have
been the case with HostedGraphite).
This should change no behaviour as the only place the environment
variable was being used was in the gunicorn config and it was an empty
string which is the default behaviour anyway as per:
https://docs.gunicorn.org/en/stable/settings.html#statsd-prefix
This will allow us to accept two different ones and therefore allow us
to rotate the secret that the admin client is sending to the API
Due to how the notifications-python-client throws exceptions, we run
into exactly the same issue with not being able to distinguish if a
`TokenDecodeError` is thrown because the token was encrypted with a
different secret key or if because there was a different error when
decoding. I've copied the TODO from `requires_auth` as this is exactly
the same issue.
I've also added a test case for functionality that was missing for an
out of date admin token (old IAT).
And support this change across our code. Note, this is a halfway step
where it is not a list rather than a string but still only supports a
single secret, ie one item in the list.
These alerts are sent to our postal provider. And it usually arrives as they are getting ready to go home for the day or the weekend.
Which means they get missed/overlooked. They have agreed to get the alert an hour earlier, perhaps that will improved the response time.
Added the queue and task names for the new template preview task to the
config. Also added the new bucket name that template preview will use
for the sanitised letters to the config for all environments.
we generally aim to share the load between the two providers equally
(more or less). When one provider has struggled, we deprioritise them,
this commit adds a function that gradually restores balance. It checks
every five minutes, if it's been more than an hour since the providers
were last changed then it adjusts them towards a 50/50 split. Except
it's not quite 50/50 due to #reasons (we want to slightly favour MMG),
it's actually 60/40. That's defined in a new dict in config.py.
these URLs never change, and it lead to surprising issues where an
updated default MMG_URL wasn't actually respected on PaaS. These urls
aren't private and don't need to be stored in credentials.
By not defining them in the manifest, we expect them to use the default
unless `cf set-env` has been specifically used to modify them in an app.
we don't use it since we wrote our own provider stubs for performance
tests.
this removes it from the api - it's still in the DB and will be
retrieved by queries, but is set to disabled on prod
the nightly tasks need to run after the create nightly notification
status task - so that test notifications are still there to record
stats for, and to stop the risk of deleting notificaitons part-way
through recording stats for them.
There's a bug in pysftp that appears to cause quadratic performance loss. See https://github.com/paramiko/paramiko/issues/1141 for more details.
As a temporary band-aid fix, lower the size of the files we're sending.
* The `_should_record_notification_in_history_table` function stopped being
used in this commit: c23ae15f32
* `NOTIFICATIONS_ALERT` stopped being used in this commit: 5aa37f09b6
we build up one personalisation dict, and then pass it in to all the
different templates - so be careful editing things. also of note, we
check if the agreement_signed_on_behalf_of is set, and send a different
template with slightly different wording to the person who clicked the
confirm button.
Added a scheduled task to run once a day and check if there were any
letters from before 17.30 that still have a status of 'created'. This
logs an exception instead of trying to fix the error because the fix
will be different depending on which bucket the letter is in.
Added a task which runs twice a day on weekdays and checks for letters that have
been in the state of `pending-virus-check` for over 90 minutes. This is
just logging an exception for now, not trying to fix things, since we
will need to manually check where the issue was.
Setting `STATSD_HOST` for an env variable allows us to switch to a
local statsd_exporter on a per-app basis.
This also changes `STATSD_ENABLED` to be on when `STATSD_HOST` is set,
avoiding the need to set it separately.
Similar to MMG, there's a new env variable FIRETEXT_URL that can be
used to override the Firetext api URL.
This will be used to stub out both providers during the load test or
can be used to run a local API against a fake provider endpoint.
antivirus is sometimes tough to get running locally - now in dev
antivirus is skipped unless `ANTIVIRUS_ENABLED=1` is set on the command
line. on all other environments it is always enabled.
make a decorator that pings cronitor before and after each task run.
Designed for use with nightly tasks, so we have visibility if they
fail. We have a bunch of cronitor monitors set up - 5 character keys
that go into a URL that we then make a GET to with a self-explanatory
url path (run/fail/complete).
the cronitor URLs are defined in the credentials repo as a dictionary
of celery task names to URL slugs. If the name passed in to the
decorator isn't in that dict, it won't run.
to use it, all you need to do is call `@cronitor(my_task_name)`
instead of `@notify_celery.task`, and make sure that the task name and
the matching slug are included in the credentials repo (or locally,
json dumped and stored in the CRONITOR_KEYS environment variable)
A recent issue with a long-running query (#2288) highlighted the
fact that even though the original HTTP connection might be closed
(for example after gorouter timeout of 15 minutes, which returns a
504 response to the client), the request worker will not be stopped.
This means that the worker is spending time and potentially DB
resources generating a response that will never be delivered.
Gunicorn's timeout setting only applies to sync workers and there
doesn't seem to be an option to interrupt individual requests in
gevent/eventlet deployments.
Since the most likely (and potentially most dangerous) scenario for
this is a long-running DB query, we can set a statement timeout on
our DB connections. This will raise a sqlalchemy.exc.OperationalError
(wrapping psycopg2.extensions.QueryCanceledError), interrupting the
request after the given timeout has been reached.
This is a Postgres client setting, so the database itself will abort
the transaction when it reaches the set timeout.
Since this will also apply to our celery tasks (including potentially
long-running nightly tasks) we set a timeout of 20 minutes to begin
with.
This can potentially be split in the future to set a different value
for each app, so that we could limit API requests even more.
We don't use FUNCTIONAL_TEST_PROVIDER_SERVICE_ID or
UNCTIONAL_TEST_PROVIDER_SMS_TEMPLATE_ID anymore so we can safely
delete them from config and tests.