Will celeryd restart drop currently running tasks? - celery

If I have two worker processes doing long-time operations. If I use /etc/init.d/celeryd restart as in the official document to restart them when they are in the middle of processing tasks, what happens then? Will they wait till they finish the tasks before shutting down? If new tasks keep coming right now, are they lining up in the queue till any worker finishes restarting? Or will celery start a new worker before old ones are shut down and routing new tasks to it so that there won't be any time no workers are available?

Related

Queries regarding celery scalability

I have few questions regarding celery. Please help me with that.
Do we need to put the project code in every celery worker? If yes, if I am increasing the number of workers and also I am updating my code, what is the best way to update the code in all the worker instances (without manually pushing code to every instance everytime)?
Using -Ofair in celery worker as argument disable prefetching in workers even if have set PREFETCH_LIMIT=8 or so?
IMPORTANT: Does rabbitmq broker assign the task to the workers or do workers pull the task from the broker?
Does it make sense to have more than one celery worker (with as many subprocesses as number of cores) in a system? I see few people run multiple celery workers in a single system.
To add to the previous question, whats the performance difference between the two scenarios: single worker (8 cores) in a system or two workers (with concurrency 4)
Please answer my questions. Thanks in advance.
Do we need to put the project code in every celery worker? If yes, if I am increasing the number of workers and also I am updating my code, what is the best way to update the code in all the worker instances (without manually pushing code to every instance everytime)?
Yes. A celery worker runs your code, and so naturally it needs access to that code. How you make the code accessible though is entirely up to you. Some approaches include:
Code updates and restarting of workers as part of deployment
If you run your celery workers in kubernetes pods this comes down to building a new docker image and upgrading your workers to the new image. Using rolling updates this can be done with zero downtime.
Scheduled synchronization from a repository and worker restarts by broadcast
If you run your celery workers in a more traditional environment or for some reason you don't want to rebuild whole images, you can use some central file system available to all workers, where you update the files e.g. syncing a git repository on a schedule or by some trigger. It is important you restart all celery workers so they reload the code. This can be done by remote control.
Dynamic loading of code for every task
For example in omega|ml we provide lambda-style serverless execution of
arbitrary python scripts which are dynamically loaded into the worker process.
To avoid module loading and dependency issues it is important to keep max-tasks-per-child=1 and use the prefork pool. While this adds some overhead it is a tradeoff that we find is easy to manage (in particular we run machine learning tasks and so the little overhead of loading scripts and restarting workers after every task is not an issue)
Using -Ofair in celery worker as argument disable prefetching in workers even if have set PREFETCH_LIMIT=8 or so?
-O fair stops workers from prefetching tasks unless there is an idle process. However there is a quirk with rate limits which I recently stumbled upon. In practice I have not experienced a problem with neither prefetching nor rate limiting, however as with any distributed system it pays of to think about the effects of the asynchronous nature of execution (this is not particular to Celery but applies to all such such systems).
IMPORTANT: Does rabbitmq broker assign the task to the workers or do workers pull the task from the broker?
Rabbitmq does not know of the workers (nor do any of the other broker supported by celery) - they just maintain a queue of messages. That is, it is the workers that pull tasks from the broker.
A concern that may come up with this is what if my worker crashes while executing tasks. There are several aspects to this: There is a distinction between a worker and the worker processes. The worker is the single task started to consume tasks from the broker, it does not execute any of the task code. The task code is executed by one of the worker processes. When using the prefork pool (which is the default) a failed worker process is simply restarted without affecting the worker as a whole or other worker processes.
Does it make sense to have more than one celery worker (with as many subprocesses as number of cores) in a system? I see few people run multiple celery workers in a single system.
That depends on the scale and type of workload you need to run. In general CPU bound tasks should be run on workers with a concurrency setting that doesn't exceed the number of cores. If you need to process more of these tasks than you have cores, run multiple workers to scale out. Note if your CPU bound task uses more than one core at a time (e.g. as is often the case in machine learning workloads/numerical processing) it is the total number of cores used per task, not the total number of tasks run concurrently that should inform your decision.
To add to the previous question, whats the performance difference between the two scenarios: single worker (8 cores) in a system or two workers (with concurrency 4)
Hard to say in general, best to run some tests. For example if 4 concurrently run tasks use all the memory on a single node, adding another worker will not help. If however you have two queues e.g. with different rates of arrival (say one for low frequency but high-priority execution, another for high frequency but low-priority) both of which can be run concurrently on the same node without concern for CPU or memory, a single node will do.

What if i schedule tasks for celery to perform every minute and it is not able to complete it in time?

If I schedule the task for every minute and if it is not able to be getting completed in the time(one minute). Would the task wait in queue and it will go on like this? if this happens then after few hours it will be overloaded. Is there any solution for this kind of problems?
I am using beat and worker combination for this. It is working fine for less records to perform tasks. but for large database, I think this could cause problem.
Task is assign to queue (RabbitMQ for example).
Workers are queue consumers, more workers (or worker with high concurrency) - more tasks could be handled in parallel.
Your periodic task produce messages of the same type (I guess) and your celery router route them to the same queue.
Just set your workers to consume messages from that queue and that's all.
celery worker -A celeryapp:app -l info -Q default -c 4 -n default_worker#%h -Ofair
In the example above I used -c 4 for concurrency of four (eqv. to 4 consumers/workers). You can also start move workers and let them consume from the same queue with -Q <queue_name> (in my example it's default queue).
EDIT:
When using celery (the worker code) you are initiate Celery object. In Celery constructor you are setting your broker and backend (celery used them as part of the system)
for more info: http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html#application

Gracefully update running celery pod in Kubernetes

I have a Kubernetes cluster running Django, Celery, RabbitMq and Celery Beat. I have several periodic tasks spaced out throughout the day (so as to keep server load down). There are only a few hours when no tasks are running, and I want to limit my rolling-updates to those times, without having to track it manually. So I'm looking for a solution that will allow me to fire off a script or task of some sort that will monitor the Celery server, and trigger a rolling update once there's a window in which no tasks are actively running. There are two possible ways I thought of doing this, but I'm not sure which is best, nor how to implement either one.
Run a script (bash or otherwise) that checks up on the Celery server every few minutes, and initiates the rolling-update if the server is inactive
Increment the celery app name before each update (in the Beat run command, the Celery run command, and in the celery.py config file), create a new Celery pod, rolling-update the Beat pod, and then delete the old Celery 12 hours later (a reasonable time span for all running tasks to finish)
Any thoughts would be greatly appreciated.

Will celery handle tasks gracefully if supervisorctl stop/start/restart is used?

In a case recently, I had to restart some inexplicably idle workers run by supervisord. We are thinking about adding a periodic restart, say, once or twice a day.
This could easily be done using supervisorctl, but is there any chance tasks will be lost while the restart occurs?

Setting up deployments with Capistrano, Sidekiq and Monit

My application uses Sidekiq to handle long (several minutes) running background tasks. Deployments are done with Capistrano 2 and all processes are monitored with Monit.
I have used capistrano-sidekiq to manage the sidekiq process during deployments but it has not worked perfectly. Some times during the deployment a new sidekiq process is started but the old one is not killed. I believe this happens because capistrano-sidekiq is not operating through Monit during the deployment.
Second problem is that because my background tasks can take several minutes to complete my deployment should allow two sidekiq processes to co-exisit. The old sidekiq process should be allowed to complete the tasks it is processing and a new sidekiq process should start taking new tasks into processing.
I have been thinking about something like this into my deploy script
When deployment starts:
I tell Monit to unmonitor the sidekiq process
I stop the current sidekiq process and give it 10 minutes to finish its tasks
After the code has been updated:
I start a new sidekiq process and tell Monit to start monitoring it.
I may need to move the sidekiq process pid file into the release directory if the pid file is not removed until the stopped sidekiq process has eventually been killed.
How does this sound? Any caveats spotted?
EDIT:
Found a good thread about this same issue.
http://librelist.com/browser//sidekiq/2014/6/5/rollback-signal-after-usr1/#f6898deccb46801950f40ad22e75471d
Seems reasonable to me. The only possible issue is losing track of the old Sidekiq's PID but you should be able to use ps and grep for "stopping" to find old Sidekiqs.