Frappe: Can we configure different number of workers for different types of Queue - erpnext

In common_site_config we have background_workers attribute. When we set this attribute to a certain value, that many number of workers are created per queue. In Frappe we have 3 queues, short, long and default. So if we set background_workers:4 we get 4 workers for each of these 3 queues.
Is it possible to set number of workers for a specific queue, like, for short and default - 5 workers - so that load can be shared and run in parallel, for long - only 1 worker so that all the work on long queue can run in sequence.
Can we set number of workers per queue type?

The values in the common_site_config.json file are used by bench to generate a corresponding Procfile (development) or supervisor.conf (production).
You can update the configuration files directly to fine tune these settings as per your needs. Checkout how supervisorctl (or honcho) works - their respective documentations.
The supervisor.conf file for your bench can be found under the config folder. After adding an additional section for your worker, a bench restart may be enough to setup your new workers.

Related

How to structure Google OR-Tools MPSolver to assign multiple tasks per worker?

I'm following the MPSolver java examples in an attempt to assign two tasks per worker. A worker must perform two tasks or no tasks, a task can be assigned to only one worker. A matrix indicates the tasks that a worker can perform. The goal is to maximize the number of worker-task combinations. Some workers and tasks may not get assigned. The examples cover single worker-task assignments, but I don't follow how to structure MPSolver constraints to associate two tasks per worker.
Subsequently, I need to minimize the amount of time each worker spends. The time worked is associated with individual workers.

What's the point of having a single celery worker with multiple queues?

continuing How does a Celery worker consuming from multiple queues decide which to consume from first?
I've setup a single worker and have it listen to two queues. I understand from the above linked question that the worker would consume messages from those two queues in round-robin or in the order they arrived (depending on celery version).
So what's the purpose of this setting? Why is it different than a single queue? Would that be helpful only for monitoring, or is there an operational benefit i'm missing here?
In most scenarios you will have your worker subscribed only to a single queue, however there are scenarios when having ability to subscribe to multiple queues makes sense.
Here is one. Imagine you have a Celery cluster of 10 machines. They perform various tasks, and among them there is a task that downloads files from remote file-server. However, the owner of the file-server whitelisted only two of your 10 machine IPs, so basically only two of them can download files from that particular file-server. Typically you will have Celery workers on these two machines subcribe to an additional queue, called "download" for an example, and schedule download tasks by sending them to the "download" queue.
This is a very common scenario where a subset of your nodes can do particular thing (access remote servers - file servers, database servers, etc).
One could argue "why not have just the 'download' queue on these two machines?" - that may be a waste of resources.

Celery prefetched tasks stuck behind other tasks

I am running into an issue on an ECS cluster including multiple Celery workers when the cluster requires up-scaling.
Some background:
I have a task which is running potentially for a few hours.
Celery workers on an ECS cluster are currently scaled based on queue depth using Flower. Whenever the queue depth is larger than 1, it scales up a worker to potentially receive more tasks.
The broker used is Redis.
I have set the worker_prefetch_multiplier to 1, and each worker's concurrency equals 4.
The problem definition:
Because of these settings, each of the workers prefetches 4 tasks, before filling the queue depth. So let's say we have a single worker running, it requires 8 tasks to be invoked before the queue depth fills to 1 on the 9th task. 4 tasks will be in the STARTED state and 4 tasks will be in the RECEIVED state. Whenever, scaling up the number of worker nodes to 2, only the 9th task will be send to this worker. However, this means that the 4 tasks in the RECEIVED state are "stuck" behind the 4 tasks in the STARTED state for potentially a few hours, which is undesirable.
Investigated solutions:
When searching for a solution one finds in Celery's documentation (https://docs.celeryproject.org/en/stable/userguide/optimizing.html) that the only way to disable prefetching is to use acks_late=True for the tasks. It indeed solves the problem that no tasks are prefetched, but it also causes other problems like replicating tasks on newly scaled worker nodes, which is DEFINITELY not what I want.
Also ofter the setting -O fair on the worker is considered to be a solution, but seemingly it still creates tasks in the RECEIVED state.
Currently, I am thinking of a little complex solution to this problem, so I would be very happy to hear other solutions. The current proposed solution is to set the concurrency to -c 2 (instead of -c 4). This would mean that 2 tasks will be prefetched on the first worker node and 2 tasks are started. All other tasks will end up in the queue, requiring a scaling event. Once ECS scaled up to two worker nodes, I will scale the concurrency of the first worker from 2 to 4 releasing the prefetched tasks.
Any ideas/suggestions?
I have found a solution for this problem (in these posts: https://github.com/celery/celery/issues/6500) with the help of #samdoolin. I will provide the full answer here for people that have the same issue as me.
Solution:
The solution provided by #samdoolin is to monkeypatch the can_consume functionality of the Consumer with a functionality to consume a message only when there are less reserved requests than the worker can handle (the worker's concurrency). In my case that would mean that it won't consume requests if there are already 4 requests active. Any request is instead accumulated in the queue, resulting in the expected behavior. Then I can easily scale the number of ECS containers holding a single worker based on the queue depth.
In practice this would look something like (thanks again to #samdoolin):
class SingleTaskLoader(AppLoader):
def on_worker_init(self):
# called when the worker starts, before logging setup
super().on_worker_init()
"""
STEP 1:
monkey patch kombu.transport.virtual.base.QoS.can_consume()
to prefer to run a delegate function,
instead of the builtin implementation.
"""
import kombu.transport.virtual
builtin_can_consume = kombu.transport.virtual.QoS.can_consume
def can_consume(self):
"""
monkey patch for kombu.transport.virtual.QoS.can_consume
if self.delegate_can_consume exists, run it instead
"""
if delegate := getattr(self, 'delegate_can_consume', False):
return delegate()
else:
return builtin_can_consume(self)
kombu.transport.virtual.QoS.can_consume = can_consume
"""
STEP 2:
add a bootstep to the celery Consumer blueprint
to supply the delegate function above.
"""
from celery import bootsteps
from celery.worker import state as worker_state
class Set_QoS_Delegate(bootsteps.StartStopStep):
requires = {'celery.worker.consumer.tasks:Tasks'}
def start(self, c):
def can_consume():
"""
delegate for QoS.can_consume
only fetch a message from the queue if the worker has
no other messages
"""
# note: reserved_requests includes active_requests
return len(worker_state.reserved_requests) == 0
# types...
# c: celery.worker.consumer.consumer.Consumer
# c.task_consumer: kombu.messaging.Consumer
# c.task_consumer.channel: kombu.transport.virtual.Channel
# c.task_consumer.channel.qos: kombu.transport.virtual.QoS
c.task_consumer.channel.qos.delegate_can_consume = can_consume
# add bootstep to Consumer blueprint
self.app.steps['consumer'].add(Set_QoS_Delegate)
# Create a Celery application as normal with the custom loader and any required **kwargs
celery = Celery(loader=SingleTaskLoader, **kwargs)
Then we start the worker via the following line:
celery -A proj worker -c 4 --prefetch-multiplier -1
Make sure that you don't forget the --prefetch-multiplier -1 option, which disables fetching new requests at all. This is will make sure that it uses the can_consume monkeypatch.
Now, when the Celery app is up, and you request 6 tasks, 4 will be executed as expected and 2 will end in the queue instead of being prefetched. This is the expected behavior without actually setting acks_late=True.
Then there is one last note I'd like to make. According to Celery's documentation, it should also be possible to pass the path to the SingleTaskLoader when starting the worker in the command line. Like this:
celery -A proj --loader path.to.SingleTaskLoader worker -c 4 --prefetch-multiplier -1
For me this did not work unfortunately. But it can be solved by actually passing it to the constructor.

Difference between executing StreamTasks in the same instance v/s multiple instances

Say I have a topic with 3 partitions
Method 1: I run one instance of Kafka Streams, it starts 3 tasks [0_0,0_1,0_2] and each of these tasks consume from one partition.
Method 2: I spin up three instance of the same streams application, here again three tasks are started but now, it is distributed among the 3 instances that was created.
Which method is preferable and why?
In method 1 do all the tasks run as a part of the same thread, and in method 2, they run on different threads, or is it different?
Consider that the streams application has a very simple topology, and does only mapping of values from a single stream
By default, a single KafkaStreams instance runs one thread, thus in "Method 1" all three tasks are executed by a single thread. In "Method 2" each task is executed by its own thread. Note, that you can also configure multiple thread pre KafkaStreams instance via num.stream.threads configuration parameter. If you set it to 3 for "Method 1" both method are more or less the same. How many threads you need, depends on your workload, ie, how many messages you need to process per time unit and how expensive the computation is. It also depends on the hardware: for a single-core CPU, it may not make sense to configure more than one thread, but you should deploy multiple instances on multiple machines to get more hardware. Hence, if your workload is lightweight one single-threaded instance might be enough.
Also note, that you may be network bound. For this case, starting more thread would not help, but you want to scale out to multiple machines, too.
The last consideration is fault-tolerance. Even if a single thread/instance may be powerful enough to not lag, what should happen if the instance crashes? If you only have one instance, the whole computation goes down. If you run two instances, the second instance would take over all the work and your application stays online.

Spring Batch - gridSize

I want some clear picture in this.
I have 2000 records but I limit 1000 records in the master for partitioning using rownum with gridSize=250 and partition across 5 slaves running in 10 machines.
I assume 1000/250= 4 steps will be created.
Whether data info sent to 4 slaves leaving 1 slave idle? If number
of steps is more than the number of slave java process, I assume the
data would be eventually distributed across all slaves.
Once all steps completed, would the slave java process memory is
freed (all objects are freed from memory as the step exists)?
If all steps completed for 1000/250=4 steps, to process the
remaining 1000 records, how can I start my new job instance without
scheduler triggers the job.
Since, you have not shown your Partitioner code, I would try to answer only on assumptions.
You don't have to assume about number of steps ( I assume 1000/250= 4 steps will be created ), it would be number of entries you create in java.util.Map<java.lang.String,ExecutionContext> that you return from your partition method of Partitioner Interface.
partition method takes gridSize as argument and its up to you to make use of this parameter or not so if you decide to do partitioning based on some other parameter ( instead of evenly distributing count ) then you can do that. Eventually, number of partitions would be number of entries in returned map and values stored in ExecutionContext can be used for fetching data in readers and so on.
Next, you can choose about number of steps to be started in parallel by setting appropriate TaskExecutor and concurrencyLimit values i.e. you might create 100 steps in partition but want to start only 4 steps in parallel and that can very well be achieved by configuration settings on top of partitioner.
Answer#1: As already pointed, data distribution has to be coded by you in your reader using ExecutionContext information you created in partitioner. It doesn't happen automatically.
Answer#2: Not sure what you exactly mean but yes, everything gets freed after completion and information is saved in meta data.
Answer#3: As already pointed out, all steps would be created in one go for all the data. Which steps run for which data and how many run in parallel can be controlled by readers and configuration.
Hope it helps !!