We use exec to start awslogs_agent and then a tail to print logs to
stdout. CF docs[1] recommend to use exec to start processes which seems
to imply that as long as there are commands running the container will
remain up and running.
This commit ensures that if there are no celery tasks running we will
kill any other processes that we have started, so that the container will
no longer be considered healthy by cloudfoundry and will be replaced.
1: https://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html#start-commands
In 4427827b2f and celery monitoring was
changed from using PID files to actually looking at processes.
If celery workers get OOM killed (for instance) the container init
script would not restart them, this is because `get_celery_pids` would
not contain any processes that contained the string celery. This would
cause the pipe to fail (-o pipefail). APP_PIDS would not get updated but
the script would continue to run. This caused the script to not restart
the celery processes.
We think the correct behaviour when celery processes are killed (i.e.
there are no more celery processes running in a container) is to kill
the container. The PaaS should then schedule new ones which may
remediate the cause of the celery processes being killed.
Upon detection of no celery processes running, some diagnostic
information from the environment is sent to the logs, e.g.:
```
CF_INSTANCE_ADDR=10.0.32.4:61012
CF_INSTANCE_INTERNAL_IP=10.255.184.9
CF_INSTANCE_GUID=81c57dbc-e706-411e-6a5f-2013
CF_INSTANCE_PORT=61012
CF_INSTANCE_IP=10.0.32.4
```
Then the script (which is the container entrypoint) exits 1.
Co-author: @servingupaces @tlwr
make a decorator that pings cronitor before and after each task run.
Designed for use with nightly tasks, so we have visibility if they
fail. We have a bunch of cronitor monitors set up - 5 character keys
that go into a URL that we then make a GET to with a self-explanatory
url path (run/fail/complete).
the cronitor URLs are defined in the credentials repo as a dictionary
of celery task names to URL slugs. If the name passed in to the
decorator isn't in that dict, it won't run.
to use it, all you need to do is call `@cronitor(my_task_name)`
instead of `@notify_celery.task`, and make sure that the task name and
the matching slug are included in the credentials repo (or locally,
json dumped and stored in the CRONITOR_KEYS environment variable)
This addresses some problems that existed in the previous approach:
1. There was a race condition that could occur between the time we were
looking for the existence of the .pid files and actually reading them.
2. If for some reason the .pid file was left behind after a process had
died, the script would never know because we do:
kill -s ${1} ${APP_PID} || true
Currently, admin app requests service statistics (with notification
counts grouped by status) and template statistics (with counts by
template) in order to display the service dashboard.
Service statistics are gathered from FactNotificationStatus table
(counts for the last 7 days) combined with Notification (counts for
today).
Template statistics are currently gathered from redis cache, which
contains a separate counter per template per day. It's hard for us
to maintain consistency between redis and DB counts. Currently it
doesn't update the count for cancelled letters, counter resets in
the middle of the day might produce a wrong result for the rest of
the week and cleared redis cache can't be repopulated for services
with low data retention periods).
Since FactNotificationStatus already contains separate counts for
each template_id we can use the existing logic with some additional
filters to get separate counts for each template and status combination,
which would allow us to populate the service dashboard page from one
query response.
If a precompiled letter can't be opened (e.g. because it isn't a valid
PDF) we were setting its billable units to 0, but not moving it to the
invalid PDF bucket. If a precompiled letter failed sanitisation, we were
moving it to the invalid PDF bucket but not setting its billable units
to 0.
This commit makes sure that we always set the billable units to 0
and move the PDF to the right bucket if it fails sanitisation or can't be
opened.