Celery: does the worker stop at maximum capacity? - celery

I set concurrency to 10 on a small server.
I submitted 20 jobs on it, I expected 10 to run, but it never went beyond 5 concurrency.
Why is this happening and is Celery automatically limiting itself based on the memory and cpu?
I add that its a small server 512mb ram running video transcoding operation (ffmpeg)

You have to check your CELERYD_PREFETCH_MULTIPLIER configuration
The default is 4
It means that all your 10 threads will pre-fetch 4 and run them one by one
# Send 20 tasks
worker-1: prefetch 4 tasks: 1-4
worker-2: prefetch 4 tasks: 5-8
worker-3: prefetch 4 tasks: 9-12
worker-4: prefetch 4 tasks: 13-16
worker-5: prefetch 4 tasks: 17-20
worker-6: no more tasks to fetch
worker-7: no more tasks to fetch
worker-8: no more tasks to fetch
worker-9: no more tasks to fetch
worker-10: no more tasks to fetch
Preftech is usefull to limit network use, if tasks are very fast to complete it reduces greatly communication between the broker and the worker and boost performance.
But if task are slow it will unbalance workers loads.
edit:
For your case (task ~30 minutes) use prefetch=1.
Also bench using 10 celery process with 1 thread (concurency=1) instead of 1 proces with 10 threads (concurency=10), it can perform a little bit better.
More info in this doc:
http://celery.readthedocs.org/en/latest/userguide/optimizing.html#prefetch-limits

Related

Celery worker and worker process

What is the relation between a worker and a worker process in celery? Does it make sense to run multiple workers on a single machine?
Here is the system configuration
8 core and 32GB RAM.
The celery configuration I tried was as below
celery -A Comments_DB worker --loglevel=INFO --concurrency=8
I want to increase the number of requests processed in a given time frame. Which is a better approach?
a. 2 Workers with concurrency set to 8 each( 2*8 = 16) or
b. 1 Worker with concurrency set to 16 *1*16=16) ?
Could anyone please clarify?
A worker (parent process) will have one or more worker processes (child processes). That way if any of the children die because of an error or because of a max task limit, the parent can kick off another child process.
One parent process with concurrency of 16 will generally have better performance than two processes with concurrency of 8. This is because there is less process overhead with one process than with two. You might want two processes if you had multiple queues and wanted to make sure that a slower queue wasn't blocking other important queue tasks from processing.

Pauses between jobs

We run a Bosh deployment on GCE and it is generally very stable. We scale workers from 2 at off-peak to 6 during the day. 4 core / 3.6 GB RAM / 100GB SSD machines.
Performance is generally good, but on occasion there are long pauses in between jobs (2-3 minutes), particularly if a lot of jobs are in progress. The spinner in the UI stops moving for the previous completed job, but the next does not start.
I presume this is related to provisioning containers for the next job, but is there some way we can speed up that operation. Is there some resource we can increase or tune that would improve that lag.
Thanks.

Spark over Yarn some tasks are extremely slower

I am using a cluster of 12 virtual machines, each of which has 16 GB memory and 6 cores(except master node with only 2 cores). To each worker node, 12GB memory and 4 cores were assigned.
When I submit a spark application to yarn, I set the number of executors to 10(1 as master manager, 1 as application master), and to maximize the parallelism of my application, most of my RDDs have 40 partitions as same as the number of cores of all executors.
The following is the problem I encountered: in some random stages, some tasks need to be processed extremely longer than others, which results in poor parallelism. As we can see in the first picture, executor 9 executed its tasks over 30s while other tasks could be finished with 1s. Furthermore, the reason for much time consumed is also randomized, sometimes just because of computation, but sometimes scheduler delay, deserialization or shuffle read. As we can see, the reason for second picture is different from first picture.
I am guessing the reason for this occurs is once some task got assigned to a specific slot, there is not enough resources on the corresponding machine, so jvm was waiting for cpus. Is my guess correct? And how to set the configuration of my cluster to avoid this situation?
computing
scheduler delay & deserialization
To get a specific answer you need to share more about what you're doing but most likely the partitions you get in one or more of your stages are unbalanced - i.e. some are much bigger than others. The result is slowdown since these partitions are handled by a specific task. One way to solve it is to increase the number of partitions or change the partitioning logic
When a big task finishes shipping the data to other tasks would take longer as well so that's why other tasks may take long

Celery: per task concurrency limits (# of workers per task)?

Is it possible to set the concurrency (the number of simultaneous workers) on a per-task level in Celery? I'm looking for something more fine-grained that CELERYD_CONCURRENCY (that sets the concurrency for the whole daemon).
The usage scenario is: I have a single celerlyd running different types of tasks with very different performance characteristics - some are fast, some very slow. For some I'd like to do as many as I can as quickly as I can, for others I'd like to ensure only one instance is running at any time (ie. concurrency of 1).
You can use automatic routing to route tasks to different queues which will be processed by celery workers with different concurrency levels.
celeryd-multi start fast slow -c:slow 3 -c:fast 5
This command launches 2 celery workers listening fast and slow queues with 3 and 5 concurrency levels respectively.
CELERY_ROUTES = {"tasks.a": {"queue": "slow"}, "tasks.b": {"queue":
"fast"}}
The tasks with type tasks.a will be processed by slow queue and tasks.b tasks by fast queue respectively.

process thread scheduling

I have the following query regarding the scheduling of process threads.
a) If my process A has 3 threads then can these threads be scheduled concurrently on the different CPUs in SMP m/c or they will be given time slice on the same cpu.
b) Suppose I have two processes A with 3 threads and Process B with 2 threads (all threads are of same priority) then cpu time allocated to each thread (time slice) is dependent on the number of threads in the process or not?
Correct me if I am wrong is it so that cpu time is allocated to process which is then shared among its threads i.e. time slice given to process A threads is less than that of Process B threads.
This depends on your OS and thread implementation. POSIX threads defines an interface for defining how threads are scheduled: whether each thread is scheduled equally or each process is scheduled equally. Not all scheduling types are supported on all platforms.
On Linux, using nptl, the default behavior is to schedule all threads equally, so a process with 10 threads might get 10 times as much time as a process with 1 thread, if all eleven threads are CPU bound.