Commit Graph

161 Commits

Author SHA1 Message Date
Leo Hemsted
b01ec05aaf run migrations if app is down
normally we check the app's status page to see if migrations need
running. However, if the _status endpoint doesn't respond with 200, we
don't necessarily want to abort the deploy - we may be trying to deploy
a code fix that fixes that status endpoint for example.

We don't know whether to run the migrations or not, so err on the side
of caution by re-running the migration. The migration itself might be
the fix that gets the app working after all.

had to do a little song and dance because sometimes the response won't
be populated before an exception is thrown
2020-06-26 15:28:28 +01:00
David McDonald
a237162106 Reduce concurrency and prefetch count of reporting celery app
We have seen the reporting app run out of memory multiple times when
dealing with overnight tasks. The app runs 11 worker threads and we
reduce this to 2 worker threads to put less pressure on a single
instance.

The number 2 was chosen as most of the tasks processed by the reporting
app only take a few minutes and only one or two usually take more than
an hour. This would mean with 2 processes across our current 2
instances, a long running task should hopefully only wait behind a few
short running tasks before being picked up and therefore we shouldn't
see large increase in overall time taken to run all our overnight
reporting tasks.

On top of reducing the concurrency for the reporting app, we also set
CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery
docs because this app deals with long running tasks.
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit

The chance in prefetch multiplier should again optimise the overall time
it takes to process our tasks by ensuring that tasks are given to
instances that have (or will soon have) spare workers to deal with them,
rather than committing to putting all the tasks on certain workers in
advance.

Note, another suggestion for improving suggested by the docs for
optimising is to start setting `ACKS_LATE` on the long running tasks.
This setting would effectively change us from prefetching 1 task per
worker to prefetching 0 tasks per worker and further optimise how we
distribute our tasks across instances. However, we decided not to try
this setting as we weren't sure whether it would conflict with our
visibility_timeout. We decided not to spend the time investigating but
it may be worth revisiting in the future, as long as tasks are
idempotent.

Overall, this commit takes us from potentially having all 18 of our
reporting tasks get fetched onto a single instance to now having a
process that will ensure tasks are distributed more fairly across
instances based on when they have available workers to process the
tasks.
2020-04-28 10:47:46 +01:00
David McDonald
5d88a1dbf4 Add -Ofair setting to reporting celery app
What this setting does is best described in
https://medium.com/@taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb#d7ec

This should be useful for the reporting app because tasks run by this
app are long running (many seconds). Ideally this code change will
mean that we are quicker to process the overnight reporting tasks,
so they all finish earlier in the morning (although are not individually quicker).

This is only being set on the reporting celery app because this
change trying to do the minimum possible to improve the reliability and speed
of our overnight reporting tasks. It may very well be useful to set this
flag on all our apps, but this should be done with some more
consideration as some of them will deal with much faster tasks (sub
0.5s) and so it may be still be appropriate or may not. Proper
investigation would be needed.

Note, the celery docs on this are also worth a read:
https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit.
However, the language can confuse this with setting with the prefetch
limit. The distinction is that prefetch grabs items off the queue, whereas the
-Ofair behaviour is to do with when items have already been prefetched
and then whether the master celery process straight away gives them to
the child (worker) processes or not.

Note, this behaviour is default for celery version 4 and above but we
are still on version 3.1.26 so we have to enable it ourselves.
2020-04-28 10:42:42 +01:00
Rebecca Law
db4b4d929d - If the task runs twice and the notification already exists ignore the primary key constraint.
- Remove prints
- Add some more tests
- Only allow the new method to run for emails
2020-03-25 12:39:15 +00:00
Rebecca Law
a13bcc6697 Reduce the pressure on the db for API post email requests.
Instead of saving the email notification to the db add it to a queue to save later.
This is an attempt to alleviate pressure on the db from the api requests.
This initial PR is to trial it see if we see improvement in the api performance an a reduction in queue pool errors. If we are happy with this we could remove the hard coding of the service id.

In a nutshell:
 - If POST /v2/notification/email is from our high volume service (hard coded for now) then create a notification to send to a queue to persist the notification to the db.
 - create a save_api_email task to persist the notification
 - return the notification
 - New worker app to process the save_api_email tasks.
2020-03-25 07:59:05 +00:00
Katie Smith
3a07d1e13d Create new sms-callbacks queue
The `delivery-worker-receipts` app will listen to this new queue, which will
be used for processing the responses from Firetext and MMG.
2020-03-19 13:41:14 +00:00
Leo Hemsted
a97948574a remove unused pytest flags 2020-02-13 12:52:12 +00:00
Rebecca Law
fe18512dd2 Change how the bash script is started.
By adding `exec` to the entrypoint bash script for the application, we can trap an EXIT from the script and execute our custom `on_exit` method with checks if the application process is busy before terminating, waiting up to 10 seconds. We don't need to trap `TERM` so that's been removed again.

Written by:
@servingupaces
@tlwr
2019-10-31 16:41:16 +00:00
Toby Lorne
4ad2e30e52 Catch the TERM signal in the run_app_*paas scripts
When Cloud Foundry applications are to be rescheduled from one cell to
another, or they are stopped, they are sent a SIGTERM signal and 10
seconds later, a SIGKILL signal.

Currently the scripts trap the POSIX defined EXIT handler, rather than
the signal directly.

In order for the signal to properly be propagated to celery, and the
celery workers, the script should call the on_exit function when
receiving a TERM signal.

Signed-off-by: Toby Lorne <toby.lornewelch-richards@digital.cabinet-office.gov.uk>
Co-authored-by: Becca <rebecca.law@digital.cabinet.office.gov.uk>
Co-authored-by: Toby <toby.lornewelch-richards@digital.cabinet.office.gov.uk>
2019-10-29 17:21:16 +00:00
Leo Hemsted
3a0bf2b23e Add reporting worker
also remove references to unused statistics queue
2019-08-15 16:42:15 +01:00
Andy Paine
655d5a4e16 AUTO-413: Use an internal app for statsd preview
- We are running statsd exporter as an app with a public route for
  Prometheus to scrape
- This updates preview to send statsd metrics over the CF internal
  networking to the statsd exporter
- Removes the sidecar statsd exporters too
2019-05-23 11:10:33 +01:00
Pea Tyczynska
b59bca0fc2 Rename workers so they are less wordy xd 2019-05-01 14:51:43 +01:00
Pea Tyczynska
6163ca8b45 Change distribution of queues among notify delivery workers
This is so that retry-tasks queue, which can have quite a lot of
load, has its own worker, and other queues are paired with queues
that flow similarly:
- letter-tasks with create-letters-pdf-tasks
- job-tasks with database-tasks
2019-04-30 12:03:06 +01:00
Alexey Bezhan
570cbc3eab Add statsd_exporter to app PaaS startup scripts
`statsd_exporter` is only started if `STATSD_HOST` is set to `localhost`.
2019-04-24 13:50:13 +01:00
Leo Hemsted
95ca533df2 add delivery-celery-beat to paas_app_wrapper
Forget about beats by dre
2019-04-12 11:10:18 +01:00
Leo Hemsted
8cc5d40291 remove old manifest files and creation script 2019-04-10 15:21:30 +01:00
Leo Hemsted
7d9cd58e89 rename public-api to api
we don't use the public-api bit anywhere - even cloudwatch overwrites
based on CW_APP_NAME (which we can get rid of as this distinction is
gone)
2019-04-10 15:19:46 +01:00
Leo Hemsted
fe77d4f654 move commands from manifest to procfile
cf v3 commands don't appear to support commands in manifest files. They
say that they do, but in practice they fail on cf v3-zdt-push with an
error message "No process types returned from stager". This can be
solved by moving the command from the manifest to a Procfile.

However, the Procfile is part of the source code, and as such is the
same for each app. To get around this, make the Procfile command invoke
a new wrapper script, which checks the NOTIFY_APP_NAME env var and then
calls the correct command
2019-04-09 17:05:56 +01:00
Athanasios Voutsadakis
3528aab25b Kill the other processes started by the script
We use exec to start awslogs_agent and then a tail to print logs to
stdout. CF docs[1] recommend to use exec to start processes which seems
to imply that as long as there are commands running the container will
remain up and running.

This commit ensures that if there are no celery tasks running we will
kill any other processes that we have started, so that the container will
no longer be considered healthy by cloudfoundry and will be replaced.

1: https://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html#start-commands
2019-01-23 16:23:58 +00:00
Toby Lorne
afcdf1f9a1 Exit if celery processes are not running
In 4427827b2f and celery monitoring was
changed from using PID files to actually looking at processes.

If celery workers get OOM killed (for instance) the container init
script would not restart them, this is because `get_celery_pids` would
not contain any processes that contained the string celery. This would
cause the pipe to fail (-o pipefail). APP_PIDS would not get updated but
the script would continue to run. This caused the script to not restart
the celery processes.

We think the correct behaviour when celery processes are killed (i.e.
there are no more celery processes running in a container) is to kill
the container. The PaaS should then schedule new ones which may
remediate the cause of the celery processes being killed.

Upon detection of no celery processes running, some diagnostic
information from the environment is sent to the logs, e.g.:

```
CF_INSTANCE_ADDR=10.0.32.4:61012
CF_INSTANCE_INTERNAL_IP=10.255.184.9
CF_INSTANCE_GUID=81c57dbc-e706-411e-6a5f-2013
CF_INSTANCE_PORT=61012
CF_INSTANCE_IP=10.0.32.4
```

Then the script (which is the container entrypoint) exits 1.

Co-author: @servingupaces @tlwr
2019-01-23 14:28:53 +00:00
Athanasios Voutsadakis
4427827b2f Handle celery PIDs more reliably
This addresses some problems that existed in the previous approach:

1. There was a race condition that could occur between the time we were
looking for the existence of the .pid files and actually reading them.

2. If for some reason the .pid file was left behind after a process had
died, the script would never know because we do:

    kill -s ${1} ${APP_PID} || true
2019-01-17 16:55:44 +00:00
Leo Hemsted
80454579ee fix multi worker exit script
previously the script would:

try and SIGTERM each celery process every second for the 9 second
timeout, and then SIGKILL every second after, with no upper bound.

This commit changes this to:
* SIGTERM each process once.
* Wait nine seconds (checking if the pid files are still present each
  second)
* SIGKILL any remaining processes once.
* exit
2019-01-02 15:15:46 +00:00
Rebecca Law
68cea04210 Fixed error message 2018-11-09 16:40:58 +00:00
Leo Hemsted
2ed50e760f Revert "Celery 4" 2018-10-09 13:27:49 +01:00
Leo Hemsted
bfc4343b0e remove pip-accel and make sure commands work if you're in a venv
remove pip-accel - it's not been updated in two years, and pins our
version of pip to a version that is several breaking changes old.

make sure commands work if you're already in a venv - mostly by
checking for presence of $VIRTUAL_ENV, and ensuring we use the correct
pip to install packages. Also clean up the commands a bit.
2018-10-04 15:52:51 +01:00
Leo Hemsted
640f00b0e8 install celery with sqs support
you need to `pip install celery[sqs]` to get the additional
dependencies that celery needs to use SQS queues - there are two libs -
boto3 and pycurl.

pycurl is a bunch of python handles around curl, so needs to be
installed from source so it can link to your curl/ssl libs. On paas and
in docker this works fine (needed to add `libcurl4-openssl-dev` to the
docker container), but on macos it can't find openssl. We need to pass
a couple of flags in:

* set the environment variable PYCURL_SSL_LIBRARY=openssl
* pass in the global options `build_ext` and `-I{openssl_headers_path}`.

As shown here:
https://github.com/pycurl/pycurl/issues/530#issuecomment-395403253

Env var is no biggie, but using any install-option flags disables
wheels for the whole pip install run. (See
https://github.com/pypa/pip/issues/2677 and
https://github.com/pypa/pip/issues/4118 for more context on the
install-options flags). A whole bunch of our dependencies don't
install nicely from source (but do from wheel), so this commit installs
pycurl separately as an initial step, with the requisite flags, and
then installs the rest of the requirements as before.

I've updated the makefile and bootstrap.sh files to reflect this, but
if you run `pip install -r requirements.txt` from scratch you will run
into issues.
2018-10-03 14:11:30 +01:00
Alexey Bezhan
75940c9566 Pin all application requirements in requirements.txt
The list of top-level dependencies is moved to requirements-app.txt,
which is used by `make freeze-requirements` to generate the full
list of requirements in requirements.txt.

This is based on alphagov/digitalmarketplace-api#615, so rationale
from that PR applies here.

We had a problem with unpinned packages on new deployments leading
to failed tests (e.g. alphagov/notifications-admin#2144) which is
why we're implementing this now.

After re-evaluating pipenv again, this still seems like the least
disruptive approach:

* pyup.io has experimental support for Pipfile, but doesn't respect
  version ranges or updating hashes in the lock file
* CloudFoundry buildpack recognizes and supports Pipfiles out of the
  box, but the support is relatively new. For example until recently
  CF would install dev packages during deployment. It's also based on
  generating a requirements file from the Pipfile, which doesn't
  properly support pinning VCS dependencies (eg it doesn't set the
  #egg= version, meaning pip will not upgrade the package if it's
  already installed).
* pipenv has a strict dependency resolution algorithm, which doesn't
  appear to be well documented and can cause some unexpected failures.
  For example, pipenv doesn't seem to be able to install `awscli-cwlogs`
  package at all, believing it to have a version conflict for `botocore`
  (which it doesn't list as a direct dependency) while neither `pip` nor
  `pip-tools` highlight any issues with it.
* While trying out `pipenv install` on our list of dependencies it would
  regularly fail to install utils with a "Will try again." message.
  While the installation succeeds after a retry, this doesn't inspire
  confidence.
* The switch to Pipfile and pipenv-managed virtualenvs requires a series
  of changes to `make` targets and scripts - replacing `pip install` with
  `pipenv`, removing references to requirements files and prefixing
  commands with `pipenv run`. While it's likely to simplify the overall
  process of managing dependencies, it would require time to properly
  implement across our applications and environments (Jenkins, PaaS,
  docker containers, and dev machines).
2018-07-10 14:59:04 +01:00
Alexey Bezhan
676e3ec39a Stream delivery worker logs to stdout when running on PaaS
Our application servers and celery workers write logs both to a
file that is shipped to CloudWatch and to stdout, which is picked
up by CloudFoundry and sent to Logit Logstash.

This works with gunicorn and single-worker celery deployments, however
celery multi daemonizes worker processes, which detaches them from
stdout, so there's no log output in `cf logs` or Logit.

To fix this, we start a separate tail process to duplicate logs written
to a file to stdout, which should be picked up by CloudFoundry.
2018-06-29 11:49:02 +01:00
Athanasios Voutsadakis
d365921093 Address PR comments
Make timeout 9s on both files
Use more descriptive variable name
Declare the logs dir as a constant
2018-03-06 16:11:42 +00:00
Athanasios Voutsadakis
3045f6233b Use continue instead of break on SIGKILL
`break`ing would keep trying to kill the same process and never move to
the next
2018-02-28 14:50:23 +00:00
Athanasios Voutsadakis
b158c66705 Set timeout to 9
This will allow us an extra second to `kill -9` any remaining processes.
2018-02-28 14:37:02 +00:00
Athanasios Voutsadakis
7d562f9e85 Get empty array when no .pid file exists 2018-02-28 14:36:40 +00:00
Athanasios Voutsadakis
1adee47230 Always get the latest PIDs to check 2018-02-28 14:36:06 +00:00
Athanasios Voutsadakis
138d4eee25 Better logging of PIDs array 2018-02-28 14:34:48 +00:00
Athanasios Voutsadakis
bef80e3414 Use eval to run the command instead of exec
`exec` is replacing the current shell to run the command, which means that
the script execution stops at that line.

Passing it to the background with `exec "$@" &` won't work either,
because the script will move directly to the next command where it
looks for the `.pid` files that have not yet been created because celery
takes a few seconds to spin up all the processes.

Using `sleep X` to remedy this seems just wrong given that

1. we can use `eval` that blocks until the command returns
2. there is no obvious benefit in sticking with `exec`
2018-02-28 14:16:15 +00:00
Athanasios Voutsadakis
9a8e771d95 Lower the termination timeout to 10s
Cloudfoundry only gives us 10s to terminate our processes:
https://docs.cloudfoundry.org/devguide/deploy-apps/app-lifecycle.html#shutdown
2018-02-27 17:44:16 +00:00
Athanasios Voutsadakis
7c36c3894d Add a start script for running celery multi
The existing script would not work with `celery multi` as it was trying
to put it in the background and the get its pid.~

`celery multi` creates a number of worker processes and stores their
PIDs in files named celeryN.pid, where N the index number of the worker
(starting at 1).
2018-02-27 17:37:09 +00:00
Chris Hill-Scott
de11751919 Merge pull request #1661 from alphagov/pytest-env
Automatically set environment vars before tests
2018-02-15 09:34:49 +00:00
Chris Hill-Scott
faa76755c3 Allow up to 10 tests to fail before stopping
This is a sensible compromise between 1 test and ALL THE TESTS.

Uses the `maxfail` option as documented here:
https://docs.pytest.org/en/latest/usage.html#stopping-after-the-first-or-n-failures
2018-02-14 13:27:06 +00:00
Chris Hill-Scott
845aad1183 Remove environment_test.sh
These config variables are now set in `pytest.ini` instead.
2018-02-14 12:19:13 +00:00
Chris Hill-Scott
e87ed0f68a Stop pytest on first failing test
If a PR is going to fail because tests aren’t passing then you:
- should know about it as quick as possible
- shouldn’t waste precious Jenkins CPU running subsequent tests

This commit adds the `-x` flag to pytest, which stops the test run as
soon as one failing test is discovered.
2018-02-14 11:54:40 +00:00
Alexey Bezhan
40c27c6e70 Don't parse AWS credentials from VCAP_SERVICES in run_app_paas
AWS credentials are provided in the environment variables directly,
so we don't need to parse them from VCAP_SERVICES
2018-01-09 10:45:03 +00:00
Alexey Bezhan
0ad5c184c2 Only update existing manifest variables when adding env secrets
Changes generate manifest script to parse variables file as YAML
and only add variables to the manifest if they're already listed
in the `env` section.

This allows us to use a single variables file for all applications
and avoid duplicating secrets across multiple files while adding
only the relevant secrets to the application manifest.
2018-01-09 10:45:03 +00:00
Alexey Bezhan
da1a7ceb0d Add a script to generate PaaS manifest with environment variables
`./scripts/generate_manifest.py` takes a path to a PaaS manifest file
and a list of variable files and prints a single CloudFoundry manifest.

The generated manifest replaces all `inherit` keys by loading the data
from parent manifests. This allows us to pipe the script output directly
to CF CLI, without saving it to disk, which minimises the risk of it
being accidentally included in the deployment artefact. The combined
manifest might differ from the results produced by CF CLI itself, so
the original manifest shouldn't normally be used on its own.

After combining the manifests the script will load and parse all listed
variable files and add them to the manifest's `env` by merging the files
together in the order they were listed (so in case of any key conflicts
the latest file overwrites previous values), upper-casing keys and
processing any list or dictionary values with `json.dumps`, so that they
can be set as environment variables.

This gives us a full list of environment variables that were previously
parsed from the CloudFoundry user services data.
2018-01-09 10:45:03 +00:00
Athanasios Voutsadakis
2470faf04b Remove unused scripts
These were referenced by appspec and have not been used since we
migrated to paas, so they can be removed.
2018-01-02 14:25:40 +00:00
Leo Hemsted
c6e6fad01f if apps crash on startup, then fail deploy process
we saw an issue where the app started, then immediately crashed due to
a setup error. However, jenkins had already returned positively, and
the deploy continued.

cf-deploy should fail if the app doesn't start up.

We do this by looking through the cloudfoundry events, and aborting
if there are any `app.crash` events for the new GUID.
2017-12-14 14:23:32 +00:00
Leo Hemsted
da468681f2 add warning to fix_migrations script for when the filename is different 2017-12-07 10:41:10 +00:00
Athanasios Voutsadakis
444876222e Touch log files in the initialization
The logs agent would complain for the missing gunicorn error log file.
2017-12-05 13:36:45 +00:00
Leo Hemsted
82acec87b6 flake8 the entire repo now that the tests are good 👌 2017-11-28 17:29:43 +00:00
Leo Hemsted
90e9a2f1b3 use flake8 instead of pycodestyle
since there are thousands and thousands of errors in the tests files
at the moment, i propose fixing those errors in separate PR for now.
2017-11-28 14:28:01 +00:00