This was after we saw an instance of the API failing it's healthcheck
even though it was still healthy enough to serve requests to users.
This follows the change we've also made to template-preview and admin of
upping the health check timeout. Unlike those where we set it to be 10
seconds, we have been less allowing here and only chosen 2.5 seconds.
This was at suggestion of Toby from PaaS as the api should generally
have quicker response times and more annoyance might be created for
users if we let an instance stick around for 10 seconds where it was
unable to serve requests successfully.
these URLs never change, and it lead to surprising issues where an
updated default MMG_URL wasn't actually respected on PaaS. These urls
aren't private and don't need to be stored in credentials.
By not defining them in the manifest, we expect them to use the default
unless `cf set-env` has been specifically used to modify them in an app.
we don't use it since we wrote our own provider stubs for performance
tests.
this removes it from the api - it's still in the DB and will be
retrieved by queries, but is set to disabled on prod
deploys take up to five minutes, during which notify-paas-autoscaler
can't scale the app. We saw 502s due to a large volume of traffic
coming in during that time, and we couldn't react cos we were
deploying.
scale up to 25 instances, the autoscaler won't be able to downscale
until after the deploy has finished.
- We are running the statsd-exporter on the PaaS now so we can use the
internal UDP route to talk to it
- Only update in preview and staging still so that we can get the
dashboards fully up to date before switching prod
- We are running a statsd exporter on tools to collect all our statsd
metrics for scraping by Prometheus
- Update preview to point there instead of at the local one which has
issues with redeployment and DNS changing
- We've been seeing an issue when traffic spikes of the http health
checks taking over 1s and PaaS killing the app
- Port health checks won't care about being stuck in a queue so should
continue to work even at high loads
- We have functional tests to catch if a deployment brings up the app
(and so passes port health check) but then doesn't work
- We are running statsd exporter as an app with a public route for
Prometheus to scrape
- This updates preview to send statsd metrics over the CF internal
networking to the statsd exporter
- Removes the sidecar statsd exporters too
all apps get a route assigned when using v3-zdt-push.
> By default, the web process has a route and one instance. Other processes have zero instances by default.
([source](https://docs.cloudfoundry.org/devguide/multiple-processes.html))
When we push apps to multiple environments they need different routes
or the second push will fail, so this means that we need to define
routes ourselves for every app.
We're also manually flagging the health-check as either "http" or
"process" - http for the api, process for all others.
If not specified, healthcheck is set to `port` by cloudfoundry - we've
seen some issues with upgrading the deployment from v2 to v3 when using
port - it adds apps to load balancer when they're not ready, which can
result in 404s. by setting healthcheck to http it'll wait for the
/status endpoint to return 200, which will wait for flask to get
everything up and running properly
all apps get a route assigned when using v3-zdt-push.
> By default, the web process has a route and one instance. Other processes have zero instances by default.
([source](https://docs.cloudfoundry.org/devguide/multiple-processes.html))
When we push apps to multiple environments they need different routes
or the second push will fail, so this means that we need to define
routes ourselves for every app.
We're also manually flagging the health-check as either "http" or
"process" - http for the api, process for all others.
This is so that retry-tasks queue, which can have quite a lot of
load, has its own worker, and other queues are paired with queues
that flow similarly:
- letter-tasks with create-letters-pdf-tasks
- job-tasks with database-tasks
Running `statsd_exporter` alongside the app process allows us to get
StatsD metrics pushed by workers to Prometheus.
This requires adding a route to the worker instances and binding the
RE prometheus discovery service. So this approach won't work for API
and admin since they already have `gunicorn` bound to the `$PORT`.
Since we're not ready to switch all apps to Prometheus metrics at once
and we don't currently have a way to push statsd metrics to multiple
destination we're using a configuration setting in the manifest template
to switch individual workers in specific environments.
`local_statsd` contains a list of environments where the app should
use local `statsd_exporter` for pushing statsd metrics instead of
HostedGraphite.
NOTIFY_APP_NAME follows precedent and just tries to strip 'notify-'
from the beginning of the string.
instances is not specified at all if not defined - it'll scale up to
the same amount of instances as currently present, and then the
autoscaler will take over anyway
newer versions of cf api don't allow you to have multiple apps per
manifest file. So, instead of our current inheritance based model, move
to the newer doc-dl/antivirus/template-preview approved jinja based
model.
the new single manifest.yml.j2 file sets a bunch of variables based on
the CF_APP variable - things like NOTIFY_APP_NAME, default instances,
etc. Then the manifest is built up to define all of the app options
based on these defaults. Things default to sensible values, which can
vary based on environment.
When adding new environment variables, you'll need to add them to the
manifest file. If they're json encoded lists, you'll need to pass them
back to the `tojson` filter, or jinja2 will print them as python lists,
with single quotes around strings.