make a decorator that pings cronitor before and after each task run.
Designed for use with nightly tasks, so we have visibility if they
fail. We have a bunch of cronitor monitors set up - 5 character keys
that go into a URL that we then make a GET to with a self-explanatory
url path (run/fail/complete).
the cronitor URLs are defined in the credentials repo as a dictionary
of celery task names to URL slugs. If the name passed in to the
decorator isn't in that dict, it won't run.
to use it, all you need to do is call `@cronitor(my_task_name)`
instead of `@notify_celery.task`, and make sure that the task name and
the matching slug are included in the credentials repo (or locally,
json dumped and stored in the CRONITOR_KEYS environment variable)
This addresses some problems that existed in the previous approach:
1. There was a race condition that could occur between the time we were
looking for the existence of the .pid files and actually reading them.
2. If for some reason the .pid file was left behind after a process had
died, the script would never know because we do:
kill -s ${1} ${APP_PID} || true
Currently, admin app requests service statistics (with notification
counts grouped by status) and template statistics (with counts by
template) in order to display the service dashboard.
Service statistics are gathered from FactNotificationStatus table
(counts for the last 7 days) combined with Notification (counts for
today).
Template statistics are currently gathered from redis cache, which
contains a separate counter per template per day. It's hard for us
to maintain consistency between redis and DB counts. Currently it
doesn't update the count for cancelled letters, counter resets in
the middle of the day might produce a wrong result for the rest of
the week and cleared redis cache can't be repopulated for services
with low data retention periods).
Since FactNotificationStatus already contains separate counts for
each template_id we can use the existing logic with some additional
filters to get separate counts for each template and status combination,
which would allow us to populate the service dashboard page from one
query response.
If a precompiled letter can't be opened (e.g. because it isn't a valid
PDF) we were setting its billable units to 0, but not moving it to the
invalid PDF bucket. If a precompiled letter failed sanitisation, we were
moving it to the invalid PDF bucket but not setting its billable units
to 0.
This commit makes sure that we always set the billable units to 0
and move the PDF to the right bucket if it fails sanitisation or can't be
opened.
A recent issue with a long-running query (#2288) highlighted the
fact that even though the original HTTP connection might be closed
(for example after gorouter timeout of 15 minutes, which returns a
504 response to the client), the request worker will not be stopped.
This means that the worker is spending time and potentially DB
resources generating a response that will never be delivered.
Gunicorn's timeout setting only applies to sync workers and there
doesn't seem to be an option to interrupt individual requests in
gevent/eventlet deployments.
Since the most likely (and potentially most dangerous) scenario for
this is a long-running DB query, we can set a statement timeout on
our DB connections. This will raise a sqlalchemy.exc.OperationalError
(wrapping psycopg2.extensions.QueryCanceledError), interrupting the
request after the given timeout has been reached.
This is a Postgres client setting, so the database itself will abort
the transaction when it reaches the set timeout.
Since this will also apply to our celery tasks (including potentially
long-running nightly tasks) we set a timeout of 20 minutes to begin
with.
This can potentially be split in the future to set a different value
for each app, so that we could limit API requests even more.