Remove http healthcheck for api instances

This was introduced in #1811 as a way to avoid sending traffic to newly
created apps where gunicorn had not started yet, such as the case during
a scaling event. These days we depend mostly on scheduled scaling and we
rarely need to scale above the scheduled values.

Yesterday we had an event where (during a traffic spike) the healthcheck
failed causing the instance to be killed and sending a 5XX response code
to all the connections that this instance was handling at the time.

However, this instance was not unhealthy and was serving traffic. The
problem stems from a combination of using async workers, having to limit
the number of database connections and a thread holding onto a db
connection for the entire duration of the request.

Specifically, we end up having requests queued up in gunicorn waiting
for other requests to finish and release the db connection. Some pages
such as the dashboard generate queries that can take >5s.

If a healthcheck request is sent during a traffic spike and the instance in
question was "unfortunate" enough to get handled a few of these long
running queries, the healthcheck request will be queued up behind these
slow requests and will fail to receive a response within 1s [docs].

Ideally we should be able to configure the healthcheck timeout to a
value of our choosing, since we can end up in this situation again in
the future.

docs: https://docs.cloudfoundry.org/devguide/deploy-apps/healthchecks.html#types
This commit is contained in:
Athanasios Voutsadakis
2018-09-18 10:34:00 +01:00
parent 39b01f145f
commit 2e12c68d2e

View File

@@ -56,8 +56,6 @@ memory: 1G
applications:
- name: notify-api
health-check-type: http
health-check-http-endpoint: /_status?simple=true
- name: notify-api-db-migration
command: sleep infinity