From 06807cf84bcb40840283cbaeb9463148951c4dee Mon Sep 17 00:00:00 2001 From: Kenneth Kehl <@kkehl@flexion.us> Date: Mon, 25 Nov 2024 11:57:06 -0800 Subject: [PATCH 1/3] change celery pool support from prefork to threads --- Makefile | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index acd31f390..ae79b930c 100644 --- a/Makefile +++ b/Makefile @@ -50,7 +50,8 @@ run-celery: ## Run celery, TODO remove purge for staging/prod -A run_celery.notify_celery worker \ --pidfile="/tmp/celery.pid" \ --loglevel=INFO \ - --concurrency=4 + --pool=threads + --concurrency=10 .PHONY: dead-code From 738b4f063e647241769e73edf6d75c9db487bd54 Mon Sep 17 00:00:00 2001 From: Kenneth Kehl <@kkehl@flexion.us> Date: Mon, 25 Nov 2024 12:17:38 -0800 Subject: [PATCH 2/3] add adr --- ...0-adr-celery-pool-support-best-practice.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) create mode 100644 docs/adrs/0010-adr-celery-pool-support-best-practice.md diff --git a/docs/adrs/0010-adr-celery-pool-support-best-practice.md b/docs/adrs/0010-adr-celery-pool-support-best-practice.md new file mode 100644 index 000000000..b8525e654 --- /dev/null +++ b/docs/adrs/0010-adr-celery-pool-support-best-practice.md @@ -0,0 +1,23 @@ +# Make best use of celery worker pools + +Status: N/A +Date: N/A + +### Context +Our API application started with initial celery pool support of 'prefork' (the default) and concurrency of 4. We continuously encountered instability, which we initially attributed to a resource leak. As a result of this we added the configuration `worker-max-tasks-per-child=500` which is a best practice. When we ran a load test of 25000 simulated messages, however, we continued to see stability issues, amounting to a crash of the app after 4 hours requiring a restage. Based on running `cf app notify-api-production` and observing that `cpu entitlement` was off the charts at 10000% to 12000% for the works, and after doing some further reading, we came to the conclusion that perhaps `prefork` pool support is not the best type of pool support for the API application. + +The problem with `prefork` is that each process has a tendency to hang onto the CPU allocated to it, even if it is not being used. Our application is not computationally intensive and largely consists of downloading strings from S3, parsing the strings, and sending them out as SMS messages. Based on the determination that our app is likely I/O bound, we elected to do an experiment where we changed pool support to `threads` and increased concurrency to `10`. The expectation is that memory usage will decrease and CPU usage will decrease and the app will not become unavailable. + +### Decision + +### Consequences + +### Author +@kenkehl + +### Stakeholders +@ccostino +@stvnrlly + +### Next Steps +- Run an after-hours load test with production configured to --pool=threads and --concurrency=10 (concurrency can be cautiously increased once we know it works) From 8c6f7ede0bb0043297f57c95fc0a841ea95ada0d Mon Sep 17 00:00:00 2001 From: Kenneth Kehl <@kkehl@flexion.us> Date: Mon, 25 Nov 2024 15:26:50 -0800 Subject: [PATCH 3/3] oops add pool=threads to manifest --- manifest.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/manifest.yml b/manifest.yml index a8a3e7f2b..39e842730 100644 --- a/manifest.yml +++ b/manifest.yml @@ -26,7 +26,7 @@ applications: - type: worker instances: ((worker_instances)) memory: ((worker_memory)) - command: newrelic-admin run-program celery -A run_celery.notify_celery worker --loglevel=INFO --concurrency=4 + command: newrelic-admin run-program celery -A run_celery.notify_celery worker --loglevel=INFO --pool=threads --concurrency=10 - type: scheduler instances: 1 memory: ((scheduler_memory))