mirror of
https://github.com/GSA/notifications-api.git
synced 2025-12-09 06:32:11 -05:00
disable prometheus writing to files from celery apps
## The existing situation To support multiple processes and eventlets recording metrics in parallel, prometheus uses files to store metrics. When you write a metric from a multiprocess app, it writes to a file. Prometheus identifies whether your app is multiprocess by looking for the existence of a `prometheus_multiproc_dir` environment var (in either case). Prometheus reads this variable at a module level (ie: at import time). Assuming it will always used within a web server, the gds_metrics library auto-sets this to `/tmp` on import, to ensure that prometheus will always be set up correctly. We also have a variety of metrics set up when we create the app. These are generally sensible metrics such as counting the number of database connections in use by measuring sqlalchemy connection events. ## The problem We have seen problems with our notify-delivery-worker-reporting app run out of space. The CELERYD_MAX_TASKS_PER_CHILD flag is set on that app which restarts each worker process every time a task runs (to avoid memory issues), however we've recently massively decreased the size and increased the number of tasks to parallelise nightly tasks. Each time a worker process restarts it will write a new file to disk. This meant that we quickly ran out of disc space, and then the entire app instance was killed. The big rub is that we don't log prometheus metrics from our worker apps! They don't expose an endpoint so there's no way to scrape them so we aren't getting any value from prometheus anyway! But because they use the same codebase they import gds_metrics and get that anyway. ## The solution gds_metrics sets the multiproc env var, however, by importing prometheus FIRST we ensure that the env var is unset at that point, and thus prometheus will harmlessly store the metrics in memory. To ensure that when we run the notify-api that still has the env var set so the stats are shared across all the gunicorn processes, we put this import as the first thing in run_celery.py
This commit is contained in:
@@ -2,6 +2,11 @@
|
||||
|
||||
from flask import Flask
|
||||
|
||||
# import prometheus before any other code. If gds_metrics is imported first it will write a prometheus file to disk
|
||||
# that will never be read from (since we don't have prometheus celery stats). If prometheus is imported first,
|
||||
# prometheus will simply store the metrics in memory
|
||||
import prometheus_client # noqa
|
||||
|
||||
# notify_celery is referenced from manifest_delivery_base.yml, and cannot be removed
|
||||
from app import notify_celery, create_app # noqa
|
||||
|
||||
|
||||
Reference in New Issue
Block a user