Setting `STATSD_HOST` for an env variable allows us to switch to a
local statsd_exporter on a per-app basis.
This also changes `STATSD_ENABLED` to be on when `STATSD_HOST` is set,
avoiding the need to set it separately.
Similar to MMG, there's a new env variable FIRETEXT_URL that can be
used to override the Firetext api URL.
This will be used to stub out both providers during the load test or
can be used to run a local API against a fake provider endpoint.
antivirus is sometimes tough to get running locally - now in dev
antivirus is skipped unless `ANTIVIRUS_ENABLED=1` is set on the command
line. on all other environments it is always enabled.
make a decorator that pings cronitor before and after each task run.
Designed for use with nightly tasks, so we have visibility if they
fail. We have a bunch of cronitor monitors set up - 5 character keys
that go into a URL that we then make a GET to with a self-explanatory
url path (run/fail/complete).
the cronitor URLs are defined in the credentials repo as a dictionary
of celery task names to URL slugs. If the name passed in to the
decorator isn't in that dict, it won't run.
to use it, all you need to do is call `@cronitor(my_task_name)`
instead of `@notify_celery.task`, and make sure that the task name and
the matching slug are included in the credentials repo (or locally,
json dumped and stored in the CRONITOR_KEYS environment variable)
A recent issue with a long-running query (#2288) highlighted the
fact that even though the original HTTP connection might be closed
(for example after gorouter timeout of 15 minutes, which returns a
504 response to the client), the request worker will not be stopped.
This means that the worker is spending time and potentially DB
resources generating a response that will never be delivered.
Gunicorn's timeout setting only applies to sync workers and there
doesn't seem to be an option to interrupt individual requests in
gevent/eventlet deployments.
Since the most likely (and potentially most dangerous) scenario for
this is a long-running DB query, we can set a statement timeout on
our DB connections. This will raise a sqlalchemy.exc.OperationalError
(wrapping psycopg2.extensions.QueryCanceledError), interrupting the
request after the given timeout has been reached.
This is a Postgres client setting, so the database itself will abort
the transaction when it reaches the set timeout.
Since this will also apply to our celery tasks (including potentially
long-running nightly tasks) we set a timeout of 20 minutes to begin
with.
This can potentially be split in the future to set a different value
for each app, so that we could limit API requests even more.
We don't use FUNCTIONAL_TEST_PROVIDER_SERVICE_ID or
UNCTIONAL_TEST_PROVIDER_SMS_TEMPLATE_ID anymore so we can safely
delete them from config and tests.
We don't seem to use recorded queries or modification tracking anywhere
in the app, and both features potentially increase memory usage.
This removes deprecated SQLALCHEMY_COMMIT_ON_TEARDOWN options. It's
been removed from the docs and the default matches the value we set
anyway.
Bumped notifications-utils to 3.7.0. Version 3.7.0 includes the
`convert_utc_to_bst` and `convert_bst_to_utc` functions and the
`LETTER_PROCESSING_DEADLINE` constant, so these have been removed from
this repo and anywhere using these has now been updated to get these
from `notifications-utils`.
Also bumped pytest by a patch version to bring in a bug fix.
When we first built letters you could only send them via a CSV upload, initially we needed a way to send those files to dvla per job.
We since stopped using this page. So let's delete it!
This was done so when notification is timed out from sending/pending
to temporary_failure, this change has to always be caught
in the ft_notification_status
Timeout sending notifications updates up to 4 days of notifications
older than 72 hours to correct failure status. It needs to run before
we update ft_notifications_status table. Otherwise the changes
don't get picked up. Notifications deletion tasks have to run
after those jobs in case our users set short data retention
policy.
- pass new, sanitised pdf for sending
- move invalid pdfs to a newly created bucket
- set status fro notifications that failed pdf validation to a new status validation-failed
- adjust existing tests
previously, we were confusing things by appending to CELERY_QUEUES in
both dev and test configs - these are executed at import time, so the
list contained all queues twice, regardless of what config you're
actually using.
Fortunately, the -Q command that we supply the workers with overrides
this config option, so other environments weren't affected. Given that,
we can tidy up this code by just declaring it in the base config every
time
There was a datetime bug in the query which resulted in files not being sent to the postal provider.
The trigger-letter-pdfs-for-day task is no longer needed, so rather than fix the query just call collate_letter_pdfs_for_day directly.
Less code is always better.
Deployment considerations: I realized this is strictly not backwards compatible if the scheduled job is in progress and a task is on the queue that no longer exists. This is ok since we will deploy this well before 17:50.
This header was introduced to ensure that no traffic was being
directed straight to the .cloudapps.digital domain. This was especially
useful for the non-production environments where access to the proper
domains is allowed for specific IP addresses while the cloudapps.digital
ones are open.
Moving to paas custom domains [1] will allow us to stop using the paas
proxies and, as a result, unbind the cloudapps.digital domain from our
apps.
This means that the X-Custom-Forwarder will become obsolete since all
our requests will be coming directly to our domain (albeit through
cloudfront) so any IP restriction can be implemented with a route
service [2].
1: https://docs.cloud.service.gov.uk/deploying_services.html#set-up-a-custom-domain-using-the-cdn-route-service
2: https://docs.cloud.service.gov.uk/deploying_services.html#route-services
Admin, API and utils were all defining a value for SMS_CHAR_COUNT_LIMIT.
This value has been updated in notifications-utils to allow text
messages to be 4 fragments long and notifications-api now gets the value of
SMS_CHAR_COUNT_LIMIT from notifications-utils instead of defining it in
config.
Also updated some tests to check for the higher limit.
Allows uploading documents to the Document Download API.
The client is configured with an API host and auth token. There's
no need for a flag to disable the client in the test environments
at the moment since the upload is only triggered by a specific
payload which would only be sent with an explicit goal of using
document download.
We've run into issues with redis expiring keys while we try and write
to them - short lived redis TTLs aren't really sustainable for keys
where we mutate the state. Template usage is a hash contained in redis
where we increment a count keyed by template_id each time a message is
sent for that template. But if the key expires, hincrby (redis command
for incrementing a value in a hash) will re-create an empty hash.
This is no good, as we need the hash to be populated with the last
seven days worth of data, which we then increment further. We can't
tell whether the hincrby created the key, so a different approach
entirely was needed:
* New redis key: <service_id>-template-usage-<YYYY-MM-DD>. Note: This
YYYY-MM-DD is BTC time so it lines up nicely with ft_billing table
* Incremented to from process_notification - if it doesn't exist yet,
it'll be created then.
* Expiry set to 8 days every time it's incremented to.
Then, at read time, we'll just read the last eight days of keys from
Redis, and sum them up. This works because we're only ever incrementing
from that one place - never setting wholesale, never recreating the
data from scratch. So we know that if the data is in redis, then it is
good and accurate data.
One thing we *don't* know and *cannot* reason about is what no key in
redis means. It could be either of:
* This is the first message that the service has sent today.
* The key was deleted from redis for some reason.
Since we set the TTL to so long, we'll never be writing to a key that
previously expired. But if there is a redis (or operator) error and the
key is deleted, then we'll have bad data - after any data loss we'll
have to rebuild the data.