Change http healthcheck invocation timeout to 10 seconds

We have seen multiple issues in production where healthchecks have
failed for our applications as responses have taken longer than 1 second
(the default health check invocation timeout) to respond and this has
marked the instance as unhealthy and restarted it. This restarting has
dropped inflight requests and caused 502s for our users.

We are not entirely sure why the healthchecks sometimes take longer than
expected. One hypothesis is large amounts of traffic slowing response
times of the apps, however we have also seen contradictory evidence
where health checks can still fail even when apps are getting very low
levels of traffic. There could also be an issue with the actual
healthcheck process itself.

Regardless of the cause, we think by changing the timeout to 10 seconds
it might stop our apps being restarted when they are infact still
healthy enough to serve requests to users. Further investigation will
also be done by the PaaS team into the health check process itself to
see if this throws any more light on the situation.

10 seconds was a fairly abritary choice that was significantly longer
than 1 second.
This commit is contained in:
David McDonald
2019-11-28 13:30:12 +00:00
parent b24a59f3ad
commit f8dc3936fc

View File

@@ -27,6 +27,7 @@ applications:
health-check-type: http
health-check-http-endpoint: '/_status?simple=true'
health-check-invocation-timeout: 10
services:
- logit-ssl-syslog-drain