From 5d88a1dbf46585070593d209d68e162a43e3a5a0 Mon Sep 17 00:00:00 2001 From: David McDonald Date: Mon, 27 Apr 2020 15:12:44 +0100 Subject: [PATCH 1/2] Add -Ofair setting to reporting celery app What this setting does is best described in https://medium.com/@taylorhughes/three-quick-tips-from-two-years-with-celery-c05ff9d7f9eb#d7ec This should be useful for the reporting app because tasks run by this app are long running (many seconds). Ideally this code change will mean that we are quicker to process the overnight reporting tasks, so they all finish earlier in the morning (although are not individually quicker). This is only being set on the reporting celery app because this change trying to do the minimum possible to improve the reliability and speed of our overnight reporting tasks. It may very well be useful to set this flag on all our apps, but this should be done with some more consideration as some of them will deal with much faster tasks (sub 0.5s) and so it may be still be appropriate or may not. Proper investigation would be needed. Note, the celery docs on this are also worth a read: https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit. However, the language can confuse this with setting with the prefetch limit. The distinction is that prefetch grabs items off the queue, whereas the -Ofair behaviour is to do with when items have already been prefetched and then whether the master celery process straight away gives them to the child (worker) processes or not. Note, this behaviour is default for celery version 4 and above but we are still on version 3.1.26 so we have to enable it ourselves. --- scripts/paas_app_wrapper.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/paas_app_wrapper.sh b/scripts/paas_app_wrapper.sh index d574694f9..b855d9192 100755 --- a/scripts/paas_app_wrapper.sh +++ b/scripts/paas_app_wrapper.sh @@ -29,7 +29,7 @@ case $NOTIFY_APP_NAME in -Q periodic-tasks 2> /dev/null ;; delivery-worker-reporting) - exec scripts/run_app_paas.sh celery -A run_celery.notify_celery worker --loglevel=INFO --concurrency=11 \ + exec scripts/run_app_paas.sh celery -A run_celery.notify_celery worker --loglevel=INFO --concurrency=11 -Ofair \ -Q reporting-tasks 2> /dev/null ;; delivery-worker-priority) From a237162106b1ca0c304e8d92f8f8184a289426a6 Mon Sep 17 00:00:00 2001 From: David McDonald Date: Mon, 27 Apr 2020 15:42:48 +0100 Subject: [PATCH 2/2] Reduce concurrency and prefetch count of reporting celery app We have seen the reporting app run out of memory multiple times when dealing with overnight tasks. The app runs 11 worker threads and we reduce this to 2 worker threads to put less pressure on a single instance. The number 2 was chosen as most of the tasks processed by the reporting app only take a few minutes and only one or two usually take more than an hour. This would mean with 2 processes across our current 2 instances, a long running task should hopefully only wait behind a few short running tasks before being picked up and therefore we shouldn't see large increase in overall time taken to run all our overnight reporting tasks. On top of reducing the concurrency for the reporting app, we also set CELERYD_PREFETCH_MULTIPLIER=1. We do this as suggested by the celery docs because this app deals with long running tasks. https://docs.celeryproject.org/en/3.1/userguide/optimizing.html#optimizing-prefetch-limit The chance in prefetch multiplier should again optimise the overall time it takes to process our tasks by ensuring that tasks are given to instances that have (or will soon have) spare workers to deal with them, rather than committing to putting all the tasks on certain workers in advance. Note, another suggestion for improving suggested by the docs for optimising is to start setting `ACKS_LATE` on the long running tasks. This setting would effectively change us from prefetching 1 task per worker to prefetching 0 tasks per worker and further optimise how we distribute our tasks across instances. However, we decided not to try this setting as we weren't sure whether it would conflict with our visibility_timeout. We decided not to spend the time investigating but it may be worth revisiting in the future, as long as tasks are idempotent. Overall, this commit takes us from potentially having all 18 of our reporting tasks get fetched onto a single instance to now having a process that will ensure tasks are distributed more fairly across instances based on when they have available workers to process the tasks. --- app/config.py | 3 +++ manifest.yml.j2 | 7 ++++++- scripts/paas_app_wrapper.sh | 2 +- 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/app/config.py b/app/config.py index 3e38d8521..42902f8e1 100644 --- a/app/config.py +++ b/app/config.py @@ -176,6 +176,9 @@ class Config(object): CELERY_TASK_SERIALIZER = 'json' # on reporting worker, restart workers after each task is executed to help prevent memory leaks CELERYD_MAX_TASKS_PER_CHILD = os.getenv('CELERYD_MAX_TASKS_PER_CHILD') + # we can set celeryd_prefetch_multiplier to be 1 for celery apps which handle only long running tasks + if os.getenv('CELERYD_PREFETCH_MULTIPLIER'): + CELERYD_PREFETCH_MULTIPLIER = os.getenv('CELERYD_PREFETCH_MULTIPLIER') CELERY_IMPORTS = ( 'app.celery.tasks', 'app.celery.scheduled_tasks', diff --git a/manifest.yml.j2 b/manifest.yml.j2 index 918b4b50e..80c991309 100644 --- a/manifest.yml.j2 +++ b/manifest.yml.j2 @@ -30,7 +30,12 @@ 'notify-delivery-worker-research': {}, 'notify-delivery-worker-sender': {'disk_quota': '2G', 'memory': '3G'}, 'notify-delivery-worker-periodic': {}, - 'notify-delivery-worker-reporting': {'additional_env_vars': {'CELERYD_MAX_TASKS_PER_CHILD': 1}}, + 'notify-delivery-worker-reporting': { + 'additional_env_vars': { + 'CELERYD_MAX_TASKS_PER_CHILD': 1, + 'CELERYD_PREFETCH_MULTIPLIER': 1, + } + }, 'notify-delivery-worker-priority': {}, 'notify-delivery-worker-letters': {}, 'notify-delivery-worker-retry-tasks': {}, diff --git a/scripts/paas_app_wrapper.sh b/scripts/paas_app_wrapper.sh index b855d9192..9e4c7f3e0 100755 --- a/scripts/paas_app_wrapper.sh +++ b/scripts/paas_app_wrapper.sh @@ -29,7 +29,7 @@ case $NOTIFY_APP_NAME in -Q periodic-tasks 2> /dev/null ;; delivery-worker-reporting) - exec scripts/run_app_paas.sh celery -A run_celery.notify_celery worker --loglevel=INFO --concurrency=11 -Ofair \ + exec scripts/run_app_paas.sh celery -A run_celery.notify_celery worker --loglevel=INFO --concurrency=2 -Ofair \ -Q reporting-tasks 2> /dev/null ;; delivery-worker-priority)