When we get a callback from SES, we identify the notification by the
SES reference that we set on the notification after sending. When we
wrote the log message, we assumed that we'd always have a notification
for every callback, so if one couldn't be found we would raise an error
log. This isn't the case for a few reasons:
* We might receive a callback before the sender worker has persisted
the reference to the database.
* We might have deleted the notification, especially if the service has
a short data retention period
* We sometimes receive callbacks for references that we have no record
of whatsoever (this is quite alarming but we have no way of knowing
why this happens)
The error logs were happening pretty frequently, and we don't have a
real way to solve them at the moment, so lets cut down on noise and
downgrade them to info level for now.
Step 1 of 2 of turning on folders for all services.
We think it’s a feature which will be useful for the majority of
services, and we think we’ve done enough research to know that it’s
mature enough to release to all services.
previously, it was too loose - checking `"name" in str(exc)` returns
false positives.
By changing from three if statements to a loop we can cut down on
unnecessary code (and ensure that the returned objects are consistent),
and by using the full check constraint name we can be sure that we're
only capturing exactly the right errors. Additionally, don't return
the original data in the error message - it's obvious what the name is
because it'll be populated in the form you just filled in.
However, until we can create a letter without a logo, we will still default to hm-government, because the dvla_organisation is set on the service.
This does simplify the code.
Also removed the inserts to letter_branding in the data migration file, because we can deploy this before the rest of the work is finished. But we will need to do it later.
We use exec to start awslogs_agent and then a tail to print logs to
stdout. CF docs[1] recommend to use exec to start processes which seems
to imply that as long as there are commands running the container will
remain up and running.
This commit ensures that if there are no celery tasks running we will
kill any other processes that we have started, so that the container will
no longer be considered healthy by cloudfoundry and will be replaced.
1: https://docs.cloudfoundry.org/devguide/deploy-apps/manifest.html#start-commands
In 4427827b2f and celery monitoring was
changed from using PID files to actually looking at processes.
If celery workers get OOM killed (for instance) the container init
script would not restart them, this is because `get_celery_pids` would
not contain any processes that contained the string celery. This would
cause the pipe to fail (-o pipefail). APP_PIDS would not get updated but
the script would continue to run. This caused the script to not restart
the celery processes.
We think the correct behaviour when celery processes are killed (i.e.
there are no more celery processes running in a container) is to kill
the container. The PaaS should then schedule new ones which may
remediate the cause of the celery processes being killed.
Upon detection of no celery processes running, some diagnostic
information from the environment is sent to the logs, e.g.:
```
CF_INSTANCE_ADDR=10.0.32.4:61012
CF_INSTANCE_INTERNAL_IP=10.255.184.9
CF_INSTANCE_GUID=81c57dbc-e706-411e-6a5f-2013
CF_INSTANCE_PORT=61012
CF_INSTANCE_IP=10.0.32.4
```
Then the script (which is the container entrypoint) exits 1.
Co-author: @servingupaces @tlwr