Celery- signal to detect idle workers - celery

I 1 celery broker and several celery workers, all communicating with rabbitMQ. In my setup, I send several tasks to my celery workers, they'll process all the tasks (it takes ~1 hour), and then I'll manually terminate my celery workers.
I want to move towards a system where if a celery worker id 'idle' (which I define as: has 0 active tasks for a time period of timeout_seconds, which I will define beforehand), the worker will be terminated programatically. All workers will have approx the same # of tasks to run, and will all go 'idle' around the same time.
I have code set up that lets me terminate workers, but I am not sure how to detect that a worker is 'idle' and ready for termination. I think I want to use a signal but it doesn't look like there is one that fits my requirement

Here where I work we have a task that is doing basically what you want - automatically scales up/down the cluster depending on the "situation". The key in this process is the Celery inspect/control API, so I suggest you get familiar with it. This is an area that is not well-documented so start with the following:
insp = celery_app.control.inspect()
active_queues = insp.active_queues()
# Note: between these two calls some nodes may shut down and disappear
# from the dictionary so may need to deal with this...
active_stats = insp.active()
You can do this in a separate IPython session while your Celery cluster runs tasks, and look at what is there...

Related

Celery prefetched tasks stuck behind other tasks

I am running into an issue on an ECS cluster including multiple Celery workers when the cluster requires up-scaling.
Some background:
I have a task which is running potentially for a few hours.
Celery workers on an ECS cluster are currently scaled based on queue depth using Flower. Whenever the queue depth is larger than 1, it scales up a worker to potentially receive more tasks.
The broker used is Redis.
I have set the worker_prefetch_multiplier to 1, and each worker's concurrency equals 4.
The problem definition:
Because of these settings, each of the workers prefetches 4 tasks, before filling the queue depth. So let's say we have a single worker running, it requires 8 tasks to be invoked before the queue depth fills to 1 on the 9th task. 4 tasks will be in the STARTED state and 4 tasks will be in the RECEIVED state. Whenever, scaling up the number of worker nodes to 2, only the 9th task will be send to this worker. However, this means that the 4 tasks in the RECEIVED state are "stuck" behind the 4 tasks in the STARTED state for potentially a few hours, which is undesirable.
Investigated solutions:
When searching for a solution one finds in Celery's documentation (https://docs.celeryproject.org/en/stable/userguide/optimizing.html) that the only way to disable prefetching is to use acks_late=True for the tasks. It indeed solves the problem that no tasks are prefetched, but it also causes other problems like replicating tasks on newly scaled worker nodes, which is DEFINITELY not what I want.
Also ofter the setting -O fair on the worker is considered to be a solution, but seemingly it still creates tasks in the RECEIVED state.
Currently, I am thinking of a little complex solution to this problem, so I would be very happy to hear other solutions. The current proposed solution is to set the concurrency to -c 2 (instead of -c 4). This would mean that 2 tasks will be prefetched on the first worker node and 2 tasks are started. All other tasks will end up in the queue, requiring a scaling event. Once ECS scaled up to two worker nodes, I will scale the concurrency of the first worker from 2 to 4 releasing the prefetched tasks.
Any ideas/suggestions?
I have found a solution for this problem (in these posts: https://github.com/celery/celery/issues/6500) with the help of #samdoolin. I will provide the full answer here for people that have the same issue as me.
Solution:
The solution provided by #samdoolin is to monkeypatch the can_consume functionality of the Consumer with a functionality to consume a message only when there are less reserved requests than the worker can handle (the worker's concurrency). In my case that would mean that it won't consume requests if there are already 4 requests active. Any request is instead accumulated in the queue, resulting in the expected behavior. Then I can easily scale the number of ECS containers holding a single worker based on the queue depth.
In practice this would look something like (thanks again to #samdoolin):
class SingleTaskLoader(AppLoader):
def on_worker_init(self):
# called when the worker starts, before logging setup
super().on_worker_init()
"""
STEP 1:
monkey patch kombu.transport.virtual.base.QoS.can_consume()
to prefer to run a delegate function,
instead of the builtin implementation.
"""
import kombu.transport.virtual
builtin_can_consume = kombu.transport.virtual.QoS.can_consume
def can_consume(self):
"""
monkey patch for kombu.transport.virtual.QoS.can_consume
if self.delegate_can_consume exists, run it instead
"""
if delegate := getattr(self, 'delegate_can_consume', False):
return delegate()
else:
return builtin_can_consume(self)
kombu.transport.virtual.QoS.can_consume = can_consume
"""
STEP 2:
add a bootstep to the celery Consumer blueprint
to supply the delegate function above.
"""
from celery import bootsteps
from celery.worker import state as worker_state
class Set_QoS_Delegate(bootsteps.StartStopStep):
requires = {'celery.worker.consumer.tasks:Tasks'}
def start(self, c):
def can_consume():
"""
delegate for QoS.can_consume
only fetch a message from the queue if the worker has
no other messages
"""
# note: reserved_requests includes active_requests
return len(worker_state.reserved_requests) == 0
# types...
# c: celery.worker.consumer.consumer.Consumer
# c.task_consumer: kombu.messaging.Consumer
# c.task_consumer.channel: kombu.transport.virtual.Channel
# c.task_consumer.channel.qos: kombu.transport.virtual.QoS
c.task_consumer.channel.qos.delegate_can_consume = can_consume
# add bootstep to Consumer blueprint
self.app.steps['consumer'].add(Set_QoS_Delegate)
# Create a Celery application as normal with the custom loader and any required **kwargs
celery = Celery(loader=SingleTaskLoader, **kwargs)
Then we start the worker via the following line:
celery -A proj worker -c 4 --prefetch-multiplier -1
Make sure that you don't forget the --prefetch-multiplier -1 option, which disables fetching new requests at all. This is will make sure that it uses the can_consume monkeypatch.
Now, when the Celery app is up, and you request 6 tasks, 4 will be executed as expected and 2 will end in the queue instead of being prefetched. This is the expected behavior without actually setting acks_late=True.
Then there is one last note I'd like to make. According to Celery's documentation, it should also be possible to pass the path to the SingleTaskLoader when starting the worker in the command line. Like this:
celery -A proj --loader path.to.SingleTaskLoader worker -c 4 --prefetch-multiplier -1
For me this did not work unfortunately. But it can be solved by actually passing it to the constructor.

In an IPython cluster how can I gracefully interrupt a worker

I want to run some jobs in a cluster, but I want to be able to kill the job if it is taking too long. Can I do this gracefully from the client, and still have the worker available to do more jobs?
My scenario is that I want to investigate how different machine learning classifiers and hyperparameters affect the time to run .fit(). If the time takes too long, I just want to abandon the task and move on to the next one.
I can find the PIDs of the workers, and I can use kill() to send a signal from the client, but sending SIGINT, SIGHUP and SIGABRT all seem to ruthlessly kill the worker, not just interrupt it. I can't put any logic in the worker code because it's the atomic call to .fit() that I want to time and interrupt.

Give an entire Celery chain priority over new tasks

I want to launch a chain of Celery tasks, and have them all execute before any newly arriving tasks do. I'll have a single worker process handling all tasks.
I guess the easiest thing to do would be to not make them a chain at all, but instead launch a single task that synchronously calls a sequence of functions. But I'd like to take advantage of Celery retries, allowing each task to be retried a different number of times.
What's the best way to do this?
If you have a single worker running a single process then as far as I can tell from working with celery (this is not explicitly documented) you should get the behavior you want.
If you want to use multiple worker processes then you may need to set CELERYD_PREFETCH_MULTIPLIER to 1.

Conditional restart of supervisord processes?

I've been using supervisord for a while -- outstanding tool. The one use case I haven't been able to figure out is, how to configure jobs to be restarted until a condition is met, then stop restarting.
Example: let's say you have a bunch of work to do, like scaling thousands of images, or servicing millions of requests on a queue. A useful pattern would be to run many workers in parallel to work on that backlog. You could set up a supervisord job that ensures 100 workers are running, and if any of them crash, supervisord will spin up replacements so the pool of workers won't shrink.
That's great until the work is done. Maybe when the backlog is gone, the number of workers should scale down to 1 or 0. Supervisord will keep spinning up the total to be 100 processes, even if each new process checks to see if there's work to be done, sees none, and shuts down very quickly.
Is there a way for a process instance or process family to communicate with supervisord to say, the autoretsart behavior is no longer needed? Better yet, is there a way to scale the number of worker processes up and down based on some condition (like number of files in a directory or ??).
I know it can be done by updating the supervisord.conf file and running supervisorctl reload, but I'd prefer something that's more declarative and self-managing if such a thing exists.
Is there a way for a process instance or process family to communicate with supervisord to say, the autoretsart behavior is no longer needed?
You can wind down an activity by making sure your processes exit with different exitcode(s) when there is no work and making those the expected exitcodes with autorestart=unexpected in the configuration.
Better yet, is there a way to scale the number of worker processes up and down based on some condition (like number of files in a directory or ??).
The trouble is that the automatic state transitions don't allow for getting processes running again from an expected EXITED state. AFAIK the only way to do this is with the XML-RPC API's startProcess, so you would need to write or find an appropriate event listener that watches for your start condition and then uses the API.
An alternate design is to wrap your worker process in an event handler watching PROCESS COMMUNICATION Events and have one normal subprocess communicating new tasks to a pool of event listeners. But that model doesn't currently eliminate a pool of waiting processes when there is no work, it just organizes the control task in a way that may make it easier to separate out task related logic and resource usage.

Least load scheduler

I'm working on a system that uses several hundreds of workers in parallel (physical devices evaluating small tasks). Some workers are faster than others so I was wondering what the easiest way to load balance tasks on them without a priori knowledge of their speed.
I was thinking about keeping track of the number of tasks a worker is currently working on with a simple counter and then sorting the list to get the worker with the lowest active task count. This way slow workers would get some tasks but not slow down the whole system. The reason I'm asking is that the current round-robin method is causing hold up with some really slow workers (100 times slower than others) that keep accumulating tasks and blocking new tasks.
It should be a simple matter of sorting the list according to the current number of active tasks, but since I would be sorting the list several times a second (average work time per task is below 25ms) I fear that this might be a major bottleneck. So is there a simple version of getting the worker with the lowest task count without having to sort over and over again.
EDIT: The tasks are pushed to the workers via an open TCP connection. Since the dependencies between the tasks are rather complex (exclusive resource usage) let's say that all tasks are assigned to start with. As soon as a task returns from the worker all tasks that are no longer blocked are queued, and a new task is pushed to the worker. The work queue will never be empty.
How about this system:
Worker reaches the end of its task queue
Worker requests more tasks from load balancer
Load balancer assigns N tasks (where N is probably more than 1, perhaps 20 - 50 if these tasks are very small).
In this system, since you are assigning new tasks when the workers are actually done, you don't have to guess at how long the remaining tasks will take.
I think that you need to provide more information about the system:
How do you get a task to a worker? Does the worker request it or does it get pushed?
How do you know if a worker is out of work, or even how much work is it doing?
How are the physical devices modeled?
What you want to do is avoid tracking anything and find a more passive way to distribute the work.