Celery worker and worker process - celery

What is the relation between a worker and a worker process in celery? Does it make sense to run multiple workers on a single machine?
Here is the system configuration
8 core and 32GB RAM.
The celery configuration I tried was as below
celery -A Comments_DB worker --loglevel=INFO --concurrency=8
I want to increase the number of requests processed in a given time frame. Which is a better approach?
a. 2 Workers with concurrency set to 8 each( 2*8 = 16) or
b. 1 Worker with concurrency set to 16 *1*16=16) ?
Could anyone please clarify?

A worker (parent process) will have one or more worker processes (child processes). That way if any of the children die because of an error or because of a max task limit, the parent can kick off another child process.
One parent process with concurrency of 16 will generally have better performance than two processes with concurrency of 8. This is because there is less process overhead with one process than with two. You might want two processes if you had multiple queues and wanted to make sure that a slower queue wasn't blocking other important queue tasks from processing.

Related

Celery: dynamically allocate concurrency based on worker memory

My celery use case: spin up a cluster of celery workers and send many tasks to that cluster, and then terminate the cluster when all of the tasks have completed (usually ~2 hrs).
I currently have it setup to use the default concurrency, which is not optimal for my use case. I see it is possible to specify a --concurrency argument in celery, which specifies the number of tasks that a worker will run in parallel. This is also not ideal for my use case, because, for example:
cluster A might have very memory intensive tasks and --concurrency=1 makes sense, but
cluster B might be memory light, and --concurrency=50 would optimize my workers.
Because I use these clusters very often for very different types of tasks, I don't want to have to manually profile the task beforehand and manually set the concurrency each time.
My desired behaviour is have memory thresholds. So for example, I can set in a config file:
min_worker_memory = .6
max_worker_memory = .8
Meaning that the worker will increment concurrency by 1 until the worker crosses over the threshold of using more than 80% memory. Then, it will decrement concurrency by 1. It will keep that concurrency for the lifetime of the cluster unless the worker memory falls below 60%, at which point it will increment concurrency by 1 again.
Are there any existing celery settings that I can leverage to do this, or will I have to implement this logic on my own? max memory per child seems somewhat close to what I want, but this ends in killed processes which is not what I want.
Unfortunately Celery does not provide an Autoscaler that scales up/down depending on the memory usage. However, being a well-designed piece of software, it gives you an interface that you may implement up to however you like. I am sure with the help of the psutil package you can easily create your own autoscaler. Documentation reference.

Autoscale Cadence clients to consume millions of Activities or run millions of workflow instance

We have millions of Activities to run or say millions of workflow instance getting created.
Can we create multiple instances of Worker or run worker with multiple thread.
Basically, I want to know, if we have millions of activities to perform or millions of workflow instance getting created. How can we autoscale.
I see two questions here:
Can we create multiple instances of Worker or run worker with multiple thread.
Yes, absolutely. The way you scale out the load is by adding more worker processes.
can we autoscale?
Autoscaling is possible by watching schedule to start latency of activity and workflow tasks. This latency represents the time a task spent in a task queue before being picked up by a worker. Ideally, if there are enough workers it is expected to be zero. But if workers cannot keep up with the load, it is going to grow as tasks being backlogged in the queue.

What is Starvation scenario in Spark streaming?

In the famous word count example for spark streaming, the spark configuration object is initialized as follows:
/* Create a local StreamingContext with two working thread and batch interval of 1 second.
The master requires 2 cores to prevent from a starvation scenario. */
val sparkConf = new SparkConf().
setMaster("local[2]").setAppName("WordCount")
Here if I change the master from local[2] to local or does not set the Master, I do not get the expected output and in fact word counting doesn't happen at all.
The comment says:
"The master requires 2 cores to prevent from a starvation scenario" that's why they have done setMaster("local[2]").
Can somebody explain me why it requires 2 cores and what is starvation scenario ?
From the documentation:
[...] note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s).
In other words, one thread will be used to run the receiver and at least one more is necessary for processing the received data. For a cluster, the number of allocated cores must be more than the number of receivers, otherwise the system can not process the data.
Hence, when running locally, you need at least 2 threads and when using a cluster at least 2 cores need to be allocated to your system.
Starvation scenario refers to this type of problem, where some threads are not able to execute at all while others make progress.
There are two classical problems where starvation is well known:
Dining philosophers
Readers-writer problem, here it's possible to synchronize the threads so the readers or writers starve. It's also possible to make sure that no starvation occurs.

Celery: per task concurrency limits (# of workers per task)?

Is it possible to set the concurrency (the number of simultaneous workers) on a per-task level in Celery? I'm looking for something more fine-grained that CELERYD_CONCURRENCY (that sets the concurrency for the whole daemon).
The usage scenario is: I have a single celerlyd running different types of tasks with very different performance characteristics - some are fast, some very slow. For some I'd like to do as many as I can as quickly as I can, for others I'd like to ensure only one instance is running at any time (ie. concurrency of 1).
You can use automatic routing to route tasks to different queues which will be processed by celery workers with different concurrency levels.
celeryd-multi start fast slow -c:slow 3 -c:fast 5
This command launches 2 celery workers listening fast and slow queues with 3 and 5 concurrency levels respectively.
CELERY_ROUTES = {"tasks.a": {"queue": "slow"}, "tasks.b": {"queue":
"fast"}}
The tasks with type tasks.a will be processed by slow queue and tasks.b tasks by fast queue respectively.

process thread scheduling

I have the following query regarding the scheduling of process threads.
a) If my process A has 3 threads then can these threads be scheduled concurrently on the different CPUs in SMP m/c or they will be given time slice on the same cpu.
b) Suppose I have two processes A with 3 threads and Process B with 2 threads (all threads are of same priority) then cpu time allocated to each thread (time slice) is dependent on the number of threads in the process or not?
Correct me if I am wrong is it so that cpu time is allocated to process which is then shared among its threads i.e. time slice given to process A threads is less than that of Process B threads.
This depends on your OS and thread implementation. POSIX threads defines an interface for defining how threads are scheduled: whether each thread is scheduled equally or each process is scheduled equally. Not all scheduling types are supported on all platforms.
On Linux, using nptl, the default behavior is to schedule all threads equally, so a process with 10 threads might get 10 times as much time as a process with 1 thread, if all eleven threads are CPU bound.