We use exec to start awslogs_agent and then a tail to print logs to
stdout. CF docs[1] recommend to use exec to start processes which seems
to imply that as long as there are commands running the container will
remain up and running.
This commit ensures that if there are no celery tasks running we will
kill any other processes that we have started, so that the container will
no longer be considered healthy by cloudfoundry and will be replaced.
1: https://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html#start-commands
In 4427827b2f and celery monitoring was
changed from using PID files to actually looking at processes.
If celery workers get OOM killed (for instance) the container init
script would not restart them, this is because `get_celery_pids` would
not contain any processes that contained the string celery. This would
cause the pipe to fail (-o pipefail). APP_PIDS would not get updated but
the script would continue to run. This caused the script to not restart
the celery processes.
We think the correct behaviour when celery processes are killed (i.e.
there are no more celery processes running in a container) is to kill
the container. The PaaS should then schedule new ones which may
remediate the cause of the celery processes being killed.
Upon detection of no celery processes running, some diagnostic
information from the environment is sent to the logs, e.g.:
```
CF_INSTANCE_ADDR=10.0.32.4:61012
CF_INSTANCE_INTERNAL_IP=10.255.184.9
CF_INSTANCE_GUID=81c57dbc-e706-411e-6a5f-2013
CF_INSTANCE_PORT=61012
CF_INSTANCE_IP=10.0.32.4
```
Then the script (which is the container entrypoint) exits 1.
Co-author: @servingupaces @tlwr
This addresses some problems that existed in the previous approach:
1. There was a race condition that could occur between the time we were
looking for the existence of the .pid files and actually reading them.
2. If for some reason the .pid file was left behind after a process had
died, the script would never know because we do:
kill -s ${1} ${APP_PID} || true
previously the script would:
try and SIGTERM each celery process every second for the 9 second
timeout, and then SIGKILL every second after, with no upper bound.
This commit changes this to:
* SIGTERM each process once.
* Wait nine seconds (checking if the pid files are still present each
second)
* SIGKILL any remaining processes once.
* exit
Our application servers and celery workers write logs both to a
file that is shipped to CloudWatch and to stdout, which is picked
up by CloudFoundry and sent to Logit Logstash.
This works with gunicorn and single-worker celery deployments, however
celery multi daemonizes worker processes, which detaches them from
stdout, so there's no log output in `cf logs` or Logit.
To fix this, we start a separate tail process to duplicate logs written
to a file to stdout, which should be picked up by CloudFoundry.
`exec` is replacing the current shell to run the command, which means that
the script execution stops at that line.
Passing it to the background with `exec "$@" &` won't work either,
because the script will move directly to the next command where it
looks for the `.pid` files that have not yet been created because celery
takes a few seconds to spin up all the processes.
Using `sleep X` to remedy this seems just wrong given that
1. we can use `eval` that blocks until the command returns
2. there is no obvious benefit in sticking with `exec`
The existing script would not work with `celery multi` as it was trying
to put it in the background and the get its pid.~
`celery multi` creates a number of worker processes and stores their
PIDs in files named celeryN.pid, where N the index number of the worker
(starting at 1).