We use exec to start awslogs_agent and then a tail to print logs to
stdout. CF docs[1] recommend to use exec to start processes which seems
to imply that as long as there are commands running the container will
remain up and running.
This commit ensures that if there are no celery tasks running we will
kill any other processes that we have started, so that the container will
no longer be considered healthy by cloudfoundry and will be replaced.
1: https://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html#start-commands
In 4427827b2f and celery monitoring was
changed from using PID files to actually looking at processes.
If celery workers get OOM killed (for instance) the container init
script would not restart them, this is because `get_celery_pids` would
not contain any processes that contained the string celery. This would
cause the pipe to fail (-o pipefail). APP_PIDS would not get updated but
the script would continue to run. This caused the script to not restart
the celery processes.
We think the correct behaviour when celery processes are killed (i.e.
there are no more celery processes running in a container) is to kill
the container. The PaaS should then schedule new ones which may
remediate the cause of the celery processes being killed.
Upon detection of no celery processes running, some diagnostic
information from the environment is sent to the logs, e.g.:
```
CF_INSTANCE_ADDR=10.0.32.4:61012
CF_INSTANCE_INTERNAL_IP=10.255.184.9
CF_INSTANCE_GUID=81c57dbc-e706-411e-6a5f-2013
CF_INSTANCE_PORT=61012
CF_INSTANCE_IP=10.0.32.4
```
Then the script (which is the container entrypoint) exits 1.
Co-author: @servingupaces @tlwr
This addresses some problems that existed in the previous approach:
1. There was a race condition that could occur between the time we were
looking for the existence of the .pid files and actually reading them.
2. If for some reason the .pid file was left behind after a process had
died, the script would never know because we do:
kill -s ${1} ${APP_PID} || true
previously the script would:
try and SIGTERM each celery process every second for the 9 second
timeout, and then SIGKILL every second after, with no upper bound.
This commit changes this to:
* SIGTERM each process once.
* Wait nine seconds (checking if the pid files are still present each
second)
* SIGKILL any remaining processes once.
* exit
remove pip-accel - it's not been updated in two years, and pins our
version of pip to a version that is several breaking changes old.
make sure commands work if you're already in a venv - mostly by
checking for presence of $VIRTUAL_ENV, and ensuring we use the correct
pip to install packages. Also clean up the commands a bit.
you need to `pip install celery[sqs]` to get the additional
dependencies that celery needs to use SQS queues - there are two libs -
boto3 and pycurl.
pycurl is a bunch of python handles around curl, so needs to be
installed from source so it can link to your curl/ssl libs. On paas and
in docker this works fine (needed to add `libcurl4-openssl-dev` to the
docker container), but on macos it can't find openssl. We need to pass
a couple of flags in:
* set the environment variable PYCURL_SSL_LIBRARY=openssl
* pass in the global options `build_ext` and `-I{openssl_headers_path}`.
As shown here:
https://github.com/pycurl/pycurl/issues/530#issuecomment-395403253
Env var is no biggie, but using any install-option flags disables
wheels for the whole pip install run. (See
https://github.com/pypa/pip/issues/2677 and
https://github.com/pypa/pip/issues/4118 for more context on the
install-options flags). A whole bunch of our dependencies don't
install nicely from source (but do from wheel), so this commit installs
pycurl separately as an initial step, with the requisite flags, and
then installs the rest of the requirements as before.
I've updated the makefile and bootstrap.sh files to reflect this, but
if you run `pip install -r requirements.txt` from scratch you will run
into issues.
The list of top-level dependencies is moved to requirements-app.txt,
which is used by `make freeze-requirements` to generate the full
list of requirements in requirements.txt.
This is based on alphagov/digitalmarketplace-api#615, so rationale
from that PR applies here.
We had a problem with unpinned packages on new deployments leading
to failed tests (e.g. alphagov/notifications-admin#2144) which is
why we're implementing this now.
After re-evaluating pipenv again, this still seems like the least
disruptive approach:
* pyup.io has experimental support for Pipfile, but doesn't respect
version ranges or updating hashes in the lock file
* CloudFoundry buildpack recognizes and supports Pipfiles out of the
box, but the support is relatively new. For example until recently
CF would install dev packages during deployment. It's also based on
generating a requirements file from the Pipfile, which doesn't
properly support pinning VCS dependencies (eg it doesn't set the
#egg= version, meaning pip will not upgrade the package if it's
already installed).
* pipenv has a strict dependency resolution algorithm, which doesn't
appear to be well documented and can cause some unexpected failures.
For example, pipenv doesn't seem to be able to install `awscli-cwlogs`
package at all, believing it to have a version conflict for `botocore`
(which it doesn't list as a direct dependency) while neither `pip` nor
`pip-tools` highlight any issues with it.
* While trying out `pipenv install` on our list of dependencies it would
regularly fail to install utils with a "Will try again." message.
While the installation succeeds after a retry, this doesn't inspire
confidence.
* The switch to Pipfile and pipenv-managed virtualenvs requires a series
of changes to `make` targets and scripts - replacing `pip install` with
`pipenv`, removing references to requirements files and prefixing
commands with `pipenv run`. While it's likely to simplify the overall
process of managing dependencies, it would require time to properly
implement across our applications and environments (Jenkins, PaaS,
docker containers, and dev machines).
Our application servers and celery workers write logs both to a
file that is shipped to CloudWatch and to stdout, which is picked
up by CloudFoundry and sent to Logit Logstash.
This works with gunicorn and single-worker celery deployments, however
celery multi daemonizes worker processes, which detaches them from
stdout, so there's no log output in `cf logs` or Logit.
To fix this, we start a separate tail process to duplicate logs written
to a file to stdout, which should be picked up by CloudFoundry.
`exec` is replacing the current shell to run the command, which means that
the script execution stops at that line.
Passing it to the background with `exec "$@" &` won't work either,
because the script will move directly to the next command where it
looks for the `.pid` files that have not yet been created because celery
takes a few seconds to spin up all the processes.
Using `sleep X` to remedy this seems just wrong given that
1. we can use `eval` that blocks until the command returns
2. there is no obvious benefit in sticking with `exec`
The existing script would not work with `celery multi` as it was trying
to put it in the background and the get its pid.~
`celery multi` creates a number of worker processes and stores their
PIDs in files named celeryN.pid, where N the index number of the worker
(starting at 1).
If a PR is going to fail because tests aren’t passing then you:
- should know about it as quick as possible
- shouldn’t waste precious Jenkins CPU running subsequent tests
This commit adds the `-x` flag to pytest, which stops the test run as
soon as one failing test is discovered.
Changes generate manifest script to parse variables file as YAML
and only add variables to the manifest if they're already listed
in the `env` section.
This allows us to use a single variables file for all applications
and avoid duplicating secrets across multiple files while adding
only the relevant secrets to the application manifest.
`./scripts/generate_manifest.py` takes a path to a PaaS manifest file
and a list of variable files and prints a single CloudFoundry manifest.
The generated manifest replaces all `inherit` keys by loading the data
from parent manifests. This allows us to pipe the script output directly
to CF CLI, without saving it to disk, which minimises the risk of it
being accidentally included in the deployment artefact. The combined
manifest might differ from the results produced by CF CLI itself, so
the original manifest shouldn't normally be used on its own.
After combining the manifests the script will load and parse all listed
variable files and add them to the manifest's `env` by merging the files
together in the order they were listed (so in case of any key conflicts
the latest file overwrites previous values), upper-casing keys and
processing any list or dictionary values with `json.dumps`, so that they
can be set as environment variables.
This gives us a full list of environment variables that were previously
parsed from the CloudFoundry user services data.
we saw an issue where the app started, then immediately crashed due to
a setup error. However, jenkins had already returned positively, and
the deploy continued.
cf-deploy should fail if the app doesn't start up.
We do this by looking through the cloudfoundry events, and aborting
if there are any `app.crash` events for the new GUID.
When generating a new migration we give it a number that increments
the latest existing migration on master. This means that when there
are multiple PRs open containing a migration and one of them gets
merged the others need to be updated to move their migration files
to apply on top of the recently merged one.
This requires renaming the file and changing migration references
for both the migration revision and down_revision.
If a PR introduced more than 1 migration they all need to be updated
one after another since each one needs to be renamed.
This adds a script to simplify this process. `./scripts/fix_migrations.py`
will check for any branch points If it finds exactly one (which
should be the common case), it asks which migration should be moved
and renames / updates references to move the selected branch on top
of the other one.
It won't resolve any conflicts within migrations themselves (eg if
both branches modified the same column) and it won't try to resolve
cases with more than 1 branch.
click (http://click.pocoo.org/) is used by flask to run its cli args.
In removing flask_script (it's unmaintained), we had to migrate all our
commands to use click. This is a change for the better in my eyes - you
don't need to define the command in several places, and it makes
managing options a bit easier.
View diff with whitespace turned off unless you're a masochist.
previously they were using sample_service fixture under the hood, but
with full permissions added - this works fine, **unless** there's
already a service with the name "sample service" in the database. This
can happen for two reasons:
* A previous test didn't tear down correctly
* This test already invoked the sample_service fixture somehow
If this happens, we just return the existing service, without modifying
its values - values that we might change in tests, such as
research mode or letters permissions.
In the future, we'll have to be vigilant! and aware! and careful! to
not use sample_service if we're doing tests involving letters, since
they create a service with a different name now
previously we didn't do this because the tests all used the same DB
(test_notifications_api), however @minglis shared a snippet that simply
creates one test db per thread.
previously in run_app_paas.sh, we captured stdout from the app and
piped that into the log file. However, this came up with a bunch of
problems, mainly:
* exceptions with stack traces often weren't formatted properly,
and kibana could not parse them
* celery logs were duplicated - we'd collect both the json logs and
the human readable stdout logs.
instead, with the updated utils library, we can use that to log json
straight to the appropriate directory directly.