If we use multiple taskqueue in temporal, how my worker know which one to poll task? - cadence-workflow

If I set 10 video taskqueue in temporal matching, if we have 5 matching services, temporal will assign 2 video taskqueue for each matching service?
If I set 10 video taskqueue in temporal matching, if we have 50 workers for them, how are they assigned to which taskqueue to poll? 5 workers poll for each queue? How do we divide which worker poll which video taskqueue? Can anyone explained the principle a little bit?

By default, a Temporal Task Queue is configured with 4 partitions. So 10 task queues are going to have 40 partitions total. Temporal uses consistent hashing to place partitions to matching hosts. Note that this algorithm doesn't guarantee exact distribution. But on average each host will end up with 8 partitions.
Temporal doesn't assign workers to task queues. Your code does that. When a worker is created a task queue name is a required parameter. In the majority of cases, you don't need to use multiple task queues. A single queue can support almost any throughput if configured with the appropriate number of partitions.
The reasons for using more than one task queue for the given application:
To route requests to separate pools of workers or specific processes
To rate limit a certain type of requests
To specify per worker limits (rate and number of parallel tasks) for certain type of request


Is there a Cadence metric that can help spot overloads for each specific activity worker?

My company would like to automatically scale the activity workers and each workflow workers independently according to the load of a tasklist.
Reading the docs I have found the following metrics for activity workers:
However these seem to be global metrics for activity workers. Is there a Cadence metric that would allow me to spot overloads for each specific activity worker?
We have 4 different activity workers : A, B, C and D
We would like to scale independently A or B or C or D without impacting the others
Understand scheduled_to_start_latency
scheduled_to_start_latency is a measurement of the time from scheduled to started by worker. From scheduled to started, a task is transferred from matching service to an activity worker.
These are the potential hotspots when this latency got high:
The matching service is too hot to dispatch tasks -- in this case, need to confirm with CPU/memory of the matching nodes
The tasklist is overloaded because it defaults to have one partition which mapped to only one matching node: https://cadenceworkflow.io/docs/operation-guide/maintain/#scale-up-a-tasklist-using-scalable-tasklist-feature -- in this case, use task per second metrics to confirm the task rate of the tasklist
The activity worker is overloaded.
How to monitor activity worker being overloaded
CPU/memory/Thread usage/Garbage collection of the activity worker is usually enough to make sure an worker is not overloaded
You can also use scheduled_to_start_latency, but the high latency could mean different things like above. Use other metrics to rule out the causes.

Multilevel feedback queue using SLURM

I was wondering if there is a way to construct a multilevel feedback queue using Slurm.
I have set up 3 partitions (fastqueue, mediumqueue and slowqueue), which have different time limits (2 minutes, 5 minutes and 10 minutes respectively). What I want to achieve is that all jobs are submitted to the fast queue at first, but when the time limit for the job is exceeded, the job is requeued in the next queue (medium queue). This would imply that fast jobs aren't slowed down by slow jobs.
Is it possible to achieve some king of multilevel feedback queue using Slurm?
Instead of using multiple partitions to achieve this, can a multilevel queue be simulated by playing around with the job priorities?
Any pointers would be greatly appreciated!

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
return conf;
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Kafka Streams thread number

I am new to Kafka Streams, I am currently confused with the maximum parallelism of Kafka Streams application. I went through following link and did not get the answer what I am trying to find.
If I have 2 input topics, one have 10 partitions and the other have 5 partitions, and only one Kafka Streams application instance is running to process these two input topics, what is the maximum thread number I can have in this case? 10 or 15?
If I have 2 input topics, one have 10 partitions and the other have 5 partitions
Sounds good. So you have 15 total partitions. Let's assume you have a simple processor topology, without joins and aggregations, so that all 15 partitions are just being statelessly transformed.
Then, each of the 15 input partitions will map to a single a Kafka Streams "task". If you have 1 thread, input from these 15 tasks will be processed by that 1 thread. If you have 15 threads, each task will have a dedicated thread to handle its input. So you can run 1 application with 15 threads or 15 applications with 1 thread and it's logically similar: you process 15 tasks in 15 threads. The only difference is that 15 applications with 1 thread allows you to spread your load over across JVMs.
Likewise, if you start 15 instances of the application, each instance with 1 thread, then each application will be assigned 1 task, and each 1 thread in each application will handle its given 1 task.
what is the maximum thread number I can have in this case? 10 or 15?
You can set your maximum thread count to anything. If your thread count across all tasks exceeds the total number of tasks, then some of the threads will remain idle.
I recommend reading https://docs.confluent.io/current/streams/architecture.html#parallelism-model, if you haven't yet. Also, study the logs your application produces when it starts up. Each thread logs the tasks it gets assigned, like this:
[2018-01-04 16:45:26,859] INFO (org.apache.kafka.streams.processor.internals.StreamThread:351) stream-thread [entities-eb9c0a9b-ecad-48c1-b4e8-715dcf2afef3-StreamThread-3] partition assignment took 110 ms.
current active tasks: [0_0, 0_2, 1_2, 2_2, 3_2, 4_2, 5_2, 6_2, 7_2, 8_2, 9_2, 10_2, 11_2, 12_2, 13_2, 14_2]
current standby tasks: []
previous active tasks: []
Dmitry's answer does not seems to be completely correct.
Then, each of the 15 input partitions will map to a single a Kafka Streams "task"
Not in general. It depends on the "structure" of your topology. It could also be only 10 tasks.
Otherwise, excellent answer from Dmitry!

How to put a rate limit on a celery queue?

I read this in the celery documentation for Task.rate_limit:
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
How do I put a rate limit on a celery queue?
Turns out it cant be done at queue level for multiple workers.
IT can be done at queue level for 1 worker. Or at queue level for each worker.
So if u say 10 jobs/ minute on 5 workers. Your workers will process upto 50 jobs per minute collectively.
So to have only 10 jobs running at a time you either chose one worker. Or chose 5 workers with a limit of 2/minute.
Update: How to exactly put the limit in settings/configuration:
task_annotations = {'tasks.<task_name>': {'rate_limit': '10/m'}}
or change the same for all tasks:
task_annotations = {'*': {'rate_limit': '10/m'}}
10/m means 10 tasks per minute, /s would mean per second. More details here: Task annotations setting
hey I am trying to find a way to do rate limit on queue, and I find out Celery can't do that, however Celery can control the rate per tasks, see this:
so for a workaround, maybe you can set up one tasks per queue(which makes sense in a lot of situations), and put the limit on task.
You can set this limit in the flower > worker pane.
there is a specified blank space for entering your limit there.
The format that is suggested to be used is also like the below:
The rate limits can be specified in seconds, minutes or hours by appending “/s”, >“/m” or “/h” to the value. Tasks will be evenly distributed over the specified >time frame.
Example: “100/m” (hundred tasks a minute). This will enforce a minimum delay of >600ms between starting two tasks on the same worker instance.