Celery. Running single celery beat + multiple celery workers scale - kubernetes

Having single celery beat running by:
celery -A app:celery beat --loglevel=DEBUG
and three workers running by:
celery -A app:celery worker -E --loglevel=ERROR -n n1
celery -A app:celery worker -E --loglevel=ERROR -n n2
celery -A app:celery worker -E --loglevel=ERROR -n n3
Same Redis DB used as messages broker for all workers and beat.
All workers started on same machine for development purposes while they will be deployed using different Kubernetes pods on production. Main idea of usage multiple workers to distribute 50-150 tasks between different Kube pods each running on 4-8 core machine. We expect that none of pod will take more tasks than he have cores until there are any worker exists that has less tasks than available cores so max amount of tasks to be executed concurrently.
So I having troubles to test it locally.
Here is local beat triggers three tasks:
[2021-08-23 21:35:32,700: DEBUG/MainProcess] Current schedule:
<ScheduleEntry: task-5872-accrual Task5872Accrual() <crontab: 36 21 * * * (m/h/d/dM/MY)>
<ScheduleEntry: task-5872-accrual2 Task5872Accrual2() <crontab: 37 21 * * * (m/h/d/dM/MY)>
<ScheduleEntry: task-5872-accrual3 Task5872Accrual3() <crontab: 38 21 * * * (m/h/d/dM/MY)>
[2021-08-23 21:35:32,700: DEBUG/MainProcess] beat: Ticking with max interval->5.00 minutes
[2021-08-23 21:35:32,701: DEBUG/MainProcess] beat: Waking up in 27.29 seconds.
[2021-08-23 21:36:00,017: DEBUG/MainProcess] beat: Synchronizing schedule...
[2021-08-23 21:36:00,026: INFO/MainProcess] Scheduler: Sending due task task-5872-accrual (Task5872Accrual)
[2021-08-23 21:36:00,035: DEBUG/MainProcess] Task5872Accrual sent. id->96e671f8-bd07-4c36-a595-b963659bee5c
[2021-08-23 21:36:00,035: DEBUG/MainProcess] beat: Waking up in 59.95 seconds.
[2021-08-23 21:37:00,041: INFO/MainProcess] Scheduler: Sending due task task-5872-accrual2 (Task5872Accrual2)
[2021-08-23 21:37:00,043: DEBUG/MainProcess] Task5872Accrual2 sent. id->532eac4d-1d10-4117-9d7e-16b3f1ae7aee
[2021-08-23 21:37:00,043: DEBUG/MainProcess] beat: Waking up in 59.95 seconds.
[2021-08-23 21:38:00,027: INFO/MainProcess] Scheduler: Sending due task task-5872-accrual3 (Task5872Accrual3)
[2021-08-23 21:38:00,029: DEBUG/MainProcess] Task5872Accrual3 sent. id->68729b64-807d-4e13-8147-0b372ce536af
[2021-08-23 21:38:00,029: DEBUG/MainProcess] beat: Waking up in 5.00 minutes.
I expect that each worker will take single task to optimize load between workers but unfortunately here how they are distributed:
So i am not sure what does different workers synchronized between each other to distribute load between them smoothly? If not can I achieve that somehow? Tried to search in Google but there are mostly about concurrency between tasks in single worker but what to do if I need to run more tasks concurrently than single machine in Kube claster is have?

You should do two things in order to achieve what you want:
Run workers with the -O fair option. Example: celery -A app:celery worker -E --loglevel=ERROR -n n1 -O fair
Make workers prefetch as little as possible with worker_prefetch_multiplier=1 in your config.

Related

Airflow Webserver Shutting down

my airflow webserver shut down abruptly around the same timing about 16:37 GMT.
My airflow scheduler runs fine (no crash) tasks still run.
There is not much error except.
Handling signal: ttou
Worker exiting (pid: 118711)
ERROR - No response from gunicorn master within 120 seconds
ERROR - Shutting down webserver
Handling signal: term
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Worker exiting
Shutting down: Master
Is it a cause of memory?
My cfg setting for webserver is standard.
# Number of seconds the webserver waits before killing gunicorn master that doesn't respond
web_server_master_timeout = 120
# Number of seconds the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 120
# Number of workers to refresh at a time. When set to 0, worker refresh is
# disabled. When nonzero, airflow periodically refreshes webserver workers by
# bringing up new ones and killing old ones.
worker_refresh_batch_size = 1
# Number of seconds to wait before refreshing a batch of workers.
worker_refresh_interval = 30
Update:
Ok its doesn't crash everyday but today I have gunicorn unable to restart log.
ERROR - [0/0] Some workers seem to have died and gunicorn did not restart them as expected
Update: 30 October 2020
[CRITICAL] WORKER TIMEOUT (pid:108237)
I am getting this, I have increased timeout to 240, twice the default value.
Anyone know why this keep arising?

How to remove all due tasks from celery scheduler DatabaseScheduler

My project has a lot of pending tasks task.com-43 to get executed on every 5 seconds. I want to remove all my pending tasks.
→ celery -A Project beat --loglevel=debug --scheduler django_celery_beat.schedulers:DatabaseScheduler
celery beat v4.2.1 (windowlicker) is starting.
__ - ... __ - _
LocalTime -> 2018-12-30 08:44:30
Configuration ->
. broker -> redis://localhost:6379//
. loader -> celery.loaders.app.AppLoader
. scheduler -> django_celery_beat.schedulers.DatabaseScheduler
. logfile -> [stderr]#%DEBUG
. maxinterval -> 5.00 seconds (5s)
[2018-12-30 08:44:30,310: DEBUG/MainProcess] Setting default socket timeout to 30
[2018-12-30 08:44:30,311: INFO/MainProcess] beat: Starting...
[2018-12-30 08:44:30,312: DEBUG/MainProcess] DatabaseScheduler: initial read
[2018-12-30 08:44:30,312: INFO/MainProcess] Writing entries...
[2018-12-30 08:44:30,312: DEBUG/MainProcess] DatabaseScheduler: Fetching database schedule
[2018-12-30 08:44:30,348: DEBUG/MainProcess] Current schedule:
[2018-12-30 08:44:30,418: INFO/MainProcess] Scheduler: Sending due task task5.com-43 (project_monitor_tasks)
[2018-12-30 08:44:30,438: DEBUG/MainProcess] beat: Synchronizing schedule...
[2018-12-30 08:44:30,438: INFO/MainProcess] Writing entries...
[2018-12-30 08:44:30,455: DEBUG/MainProcess] project_monitor_tasks sent. id->d440432f-111d-4c96-ab4f-00923f4cf7e1
[2018-12-30 08:44:30,464: DEBUG/MainProcess] beat: Waking up in 4.93 seconds.
[2018-12-30 08:44:35,413: INFO/MainProcess] Scheduler: Sending due task task.com-43 (project_monitor_tasks)
[2018-12-30 08:44:35,414: DEBUG/MainProcess] project_monitor_tasks sent. id->ff0438ce-9fb9-4ab0-aa8a-8a7636c67d90
[2018-12-30 08:44:35,424: DEBUG/MainProcess] beat: Waking up in 4.98 seconds.
[2018-12-30 08:44:40,419: INFO/MainProcess] Scheduler: Sending due task task.com-43 (project_monitor_tasks)
[2018-12-30 08:44:40,420: DEBUG/MainProcess] project_monitor_tasks sent. id->d0022780-7d5f-4e7b-965e-9fda0d607cbe
[2018-12-30 08:44:40,431: DEBUG/MainProcess] beat: Waking up in 4.98 seconds.
[2018-12-30 08:44:45,425: INFO/MainProcess] Scheduler: Sending due task task.com-43 (project_monitor_tasks)
[2018-12-30 08:44:45,427: DEBUG/MainProcess] project_monitor_tasks sent. id->9b3eb775-60d5-4daa-a019-e0dfae932380
[2018-12-30 08:44:45,439: DEBUG/MainProcess] beat: Waking up in 4.98 seconds.
....
....
I'm using Redis for the backend database for Project tasks, I Tried Purging The Celery & Flushing the redis but still, it is executing all pending tasks.
ps auxww | grep 'celery worker' | awk '{print $2}' | xargs kill -9 ## Stopping all workers first
celery -A project purge
redis-cli FLUSHALL
service redis-server restart
One way you could remove all the tasks is by deleting the tasks from Periodic Tasks Models but first stop all your workers & purge all project tasks.
The answer to the question is here:
https://stackoverflow.com/a/33047721/10372434

celery beat instantly stopping with resource error

The logs of celery beat is like this where the last line just stopped and not continuing and recovering anymore.
[2018-08-20 11:20:59,002: INFO/MainProcess] Scheduler: Sending due task check result delays every 10sec (notify_delay)
[2018-08-20 11:21:00,000: INFO/MainProcess] Scheduler: Sending due task load abnormal schedules (load_abnormal_schedules)
[2018-08-20 11:21:00,004: INFO/MainProcess] Scheduler: Sending due task check close schedule every 5sec (close_schedule)
[2018-08-20 11:21:05,000: INFO/MainProcess] Scheduler: Sending due task check close schedule every 5sec (close_schedule)
[2018-08-20 11:21:10,000: INFO/MainProcess] Scheduler: Sending due task check close schedule every 5sec (close_schedule)
[2018-08-20 11:21:14,002: INFO/MainProcess] Scheduler: Sending due task check result delays every 10sec (notify_delay)
[2018-08-20 11:21:15,000: INFO/MainProcess] Scheduler: Sending due task load abnormal schedules (load_abnormal_schedules)
[2018-08-20 11:21:15,003: INFO/MainProcess] Scheduler: Sending due task check close schedule every 5sec (close_schedule)
[2018-08-20 11:21:20,000: INFO/MainProcess] Scheduler: Sending due task check close schedule every 5sec (close_schedule)
[2018-08-20 11:21:25,000: INFO/MainProcess] Scheduler: Sending due task check close schedule every 5sec (close_schedule)
[2018-08-20 11:21:29,003: INFO/MainProcess] Scheduler: Sending due task check result delays every 10sec (notify_delay)
It is run inside a docker container. When I checked via top it shows a high CPU percentage
120549 root 20 0 356016 150144 16388 S 23.4 1.0 3:36.33 celery
Then when I ssh inside container and try to the celery beat command. The error below initially returned
root#4a298cc9c6e2:/usr/src/app# celery -A ghost beat -l info --pidfile=
celery beat v4.2.0 (windowlicker) is starting.
__ - ... __ - _
LocalTime -> 2018-08-20 11:32:51
Configuration ->
. broker -> amqp://ghost:**#ghost-rabbitmq:5672/ghost
. loader -> celery.loaders.app.AppLoader
. scheduler -> celery.beat.PersistentScheduler
. db -> celerybeat-schedule
. logfile -> [stderr]#%INFO
. maxinterval -> 5.00 minutes (300s)
[2018-08-20 11:32:51,526: INFO/MainProcess] beat: Starting...
[2018-08-20 11:32:51,535: ERROR/MainProcess] Removing corrupted schedule file 'celerybeat-schedule': error(11, 'Resource temporarily unavailable')
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/kombu/utils/objects.py", line 42, in __get__
return obj.__dict__[self.__name__]
KeyError: 'scheduler'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 476, in setup_schedule
self._store = self._open_schedule()
File "/usr/local/lib/python3.6/site-packages/celery/beat.py", line 466, in _open_schedule
return self.persistence.open(self.schedule_filename, writeback=True)
File "/usr/local/lib/python3.6/shelve.py", line 243, in open
return DbfilenameShelf(filename, flag, protocol, writeback)
File "/usr/local/lib/python3.6/shelve.py", line 227, in __init__
Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
File "/usr/local/lib/python3.6/dbm/__init__.py", line 94, in open
return mod.open(file, flag, mode)
_gdbm.error: [Errno 11] Resource temporarily unavailable
Take note that I'm only using pure celery and not django-celery-beat
My dear friend, when you every up your docker container, celery wants create celerybeat.pid file and If exist django raises error about this. So you should add these command for delete current celerybeat.pid file on your Dockerfile like this:
COPY entrypoint.sh /code/entrypoint.sh
ENTRYPOINT ["/code/entrypoint.sh"]
RUN chmod +x /entrypoint.sh
And you should create a entrypoint.sh file like below:
#!/bin/sh
rm -rf /code/badpanty/*.pid
exec "$#"
I hope it's helpful.

Why do I have 4 celery processes instead of the 2 I expected?

I have configured celery to run 2 workers, each with a concurrency of 1. My /etc/default/celeryd file contains (amongst other settings):
CELERYD_NODES="worker1 worker2"
CELERYD_OPTS="-Q:worker1 central -c:worker1 1 -Q:worker2 RetailSpider -c:worker2 1"
In other words, I expect 2 workers and since concurrency is 1, 1 process per worker; one worker consumes from the queue 'central' and the other consumes from a queue called 'RetailSpider'. Both have concurrency 1.
Also sudo service celeryd status shows:
celery init v10.1.
Using config script: /etc/default/celeryd
celeryd (node worker1) (pid 46610) is up...
celeryd (node worker2) (pid 46621) is up...
However what is puzzling me is the output of ps aux|grep 'celery worker', which is
scraper 34384 0.0 1.0 348780 77780 ? S 13:07 0:00 /opt/scraper/evo-scrape/venv/bin/python -m celery worker --app=evofrontend --loglevel=INFO -Q central -c 1 --logfile=/opt/scraper/evo-scrape/evofrontend/logs/celery/worker1.log --pidfile=/opt/scraper/evo-scrape/evofrontend/run/celery/worker1.pid --hostname=worker1#scraping0-evo
scraper 34388 0.0 1.0 348828 77884 ? S 13:07 0:00 /opt/scraper/evo-scrape/venv/bin/python -m celery worker --app=evofrontend --loglevel=INFO -Q RetailSpider -c 1 --logfile=/opt/scraper/evo-scrape/evofrontend/logs/celery/worker2.log --pidfile=/opt/scraper/evo-scrape/evofrontend/run/celery/worker2.pid --hostname=worker2#scraping0-evo
scraper 46610 0.1 1.2 348780 87552 ? Sl Apr26 1:55 /opt/scraper/evo-scrape/venv/bin/python -m celery worker --app=evofrontend --loglevel=INFO -Q central -c 1 --logfile=/opt/scraper/evo-scrape/evofrontend/logs/celery/worker1.log --pidfile=/opt/scraper/evo-scrape/evofrontend/run/celery/worker1.pid --hostname=worker1#scraping0-evo
scraper 46621 0.1 1.2 348828 87920 ? Sl Apr26 1:53 /opt/scraper/evo-scrape/venv/bin/python -m celery worker --app=evofrontend --loglevel=INFO -Q RetailSpider -c 1 --logfile=/opt/scraper/evo-scrape/evofrontend/logs/celery/worker2.log --pidfile=/opt/scraper/evo-scrape/evofrontend/run/celery/worker2.pid --hostname=worker2#scraping0-evo
What are the additional 2 processes - the ones with ids 34384 and 34388?
(This is a Django project)
EDIT:
I wonder if this is somehow related to the fact that celery by default launches as many concurrent worker processes, as the number of CPUs/cores available. This machine as 2 cores, hence 2 per worker. However, I would have expected the -c:worker1 1 and -c:worker2 1 options to override that.
I added --concurrency=1 to CELERYD_OPTS and also CELERYD_CONCURRENCY = 1 to settings.py. I then killed all processes and restarted celeryd, yet I still saw 4 processes (2 per worker).

Why do I have many more jobs `started` than running or suspended?

According to the bqueues manual page:
STARTED
Number of job slots used by running or
suspended jobs owned by users or user groups in
the queue.
According to bqueues, I have 369 jobs started:
$ bqueues -r lotus | egrep '(STARTED|gholl)'
USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIME ADJUST
gholl 10 0.006 369 0 2334366.5 723589 0.000
But when I run bjobs, it only shows 24 jobs that are running or suspended:
$ bjobs | egrep '(RUN|SUSP)' | wc -l
24
What explains the discrepancy between 24 jobs running and 369 jobs started?
The number in STARTED refers to the number of slots. One job may take up more than one slot if it uses multiple threads. For example, if a job is submitted using bsubs with the flag -n 16, then each job will use 16 jobs. 23×16+1=368, so in the example above, user gholl has 23 jobs using 16 slots and 1 job using 1 slot.