Kafka Streams thread number - apache-kafka

I am new to Kafka Streams, I am currently confused with the maximum parallelism of Kafka Streams application. I went through following link and did not get the answer what I am trying to find.
https://docs.confluent.io/current/streams/faq.html#streams-faq-scalability-maximum-parallelism
If I have 2 input topics, one have 10 partitions and the other have 5 partitions, and only one Kafka Streams application instance is running to process these two input topics, what is the maximum thread number I can have in this case? 10 or 15?

If I have 2 input topics, one have 10 partitions and the other have 5 partitions
Sounds good. So you have 15 total partitions. Let's assume you have a simple processor topology, without joins and aggregations, so that all 15 partitions are just being statelessly transformed.
Then, each of the 15 input partitions will map to a single a Kafka Streams "task". If you have 1 thread, input from these 15 tasks will be processed by that 1 thread. If you have 15 threads, each task will have a dedicated thread to handle its input. So you can run 1 application with 15 threads or 15 applications with 1 thread and it's logically similar: you process 15 tasks in 15 threads. The only difference is that 15 applications with 1 thread allows you to spread your load over across JVMs.
Likewise, if you start 15 instances of the application, each instance with 1 thread, then each application will be assigned 1 task, and each 1 thread in each application will handle its given 1 task.
what is the maximum thread number I can have in this case? 10 or 15?
You can set your maximum thread count to anything. If your thread count across all tasks exceeds the total number of tasks, then some of the threads will remain idle.
I recommend reading https://docs.confluent.io/current/streams/architecture.html#parallelism-model, if you haven't yet. Also, study the logs your application produces when it starts up. Each thread logs the tasks it gets assigned, like this:
[2018-01-04 16:45:26,859] INFO (org.apache.kafka.streams.processor.internals.StreamThread:351) stream-thread [entities-eb9c0a9b-ecad-48c1-b4e8-715dcf2afef3-StreamThread-3] partition assignment took 110 ms.
current active tasks: [0_0, 0_2, 1_2, 2_2, 3_2, 4_2, 5_2, 6_2, 7_2, 8_2, 9_2, 10_2, 11_2, 12_2, 13_2, 14_2]
current standby tasks: []
previous active tasks: []

Dmitry's answer does not seems to be completely correct.
Then, each of the 15 input partitions will map to a single a Kafka Streams "task"
Not in general. It depends on the "structure" of your topology. It could also be only 10 tasks.
Otherwise, excellent answer from Dmitry!

Related

If we use multiple taskqueue in temporal, how my worker know which one to poll task?

If I set 10 video taskqueue in temporal matching, if we have 5 matching services, temporal will assign 2 video taskqueue for each matching service?
If I set 10 video taskqueue in temporal matching, if we have 50 workers for them, how are they assigned to which taskqueue to poll? 5 workers poll for each queue? How do we divide which worker poll which video taskqueue? Can anyone explained the principle a little bit?
If I set 10 video taskqueue in temporal matching, if we have 5
matching services, temporal will assign 2 video taskqueue for each
matching service?
By default, a Temporal Task Queue is configured with 4 partitions. So 10 task queues are going to have 40 partitions total. Temporal uses consistent hashing to place partitions to matching hosts. Note that this algorithm doesn't guarantee exact distribution. But on average each host will end up with 8 partitions.
If I set 10 video taskqueue in temporal matching, if we have 50
workers for them, how are they assigned to which taskqueue to poll? 5
workers poll for each queue? How do we divide which worker poll which
video taskqueue? Can anyone explained the principle a little bit?
Temporal doesn't assign workers to task queues. Your code does that. When a worker is created a task queue name is a required parameter. In the majority of cases, you don't need to use multiple task queues. A single queue can support almost any throughput if configured with the appropriate number of partitions.
The reasons for using more than one task queue for the given application:
To route requests to separate pools of workers or specific processes
To rate limit a certain type of requests
To specify per worker limits (rate and number of parallel tasks) for certain type of request

Does Kafka Streams library kill idle StreamThreads?

Say, KStream topology is simple: input topic -> process -> output topic. Partitions of input topic = 4.
If there is a single instance of app running with num.stream.threads=4, all 4 StreamThreads are utilized.
If a second instance is launched (with num.stream.threads=4), stream tasks are now distributed between the two. Task 0_1 and 0_2 on first instance, Task 0_3 and 0_4 on second instance.
On first instance, does kafka streams library kill the threads which were running 0_3 and 0_4 so far?
For your case when input topic has only 4 partitions, what will happen When starting 8 instances with num.stream.threads=1?
4 instances become idle but not be killed. They are remaining and get assign a task if any of other already assigned instance goes down.
So, same thing happens when you start multiple treads in one instances. In your case, 8 treads in 2 instances 4 per each. same scenario happen I will explained above. 4 of your threads getting idle and remain idle until it is getting a task by going down other instance.
more reference :
streams-faq-scalability-maximum-parallelism
kafka-streams-internals-StreamThread
Let’s take an example. Imagine your application is reading from an input
topic that has 5 partitions. How many app instances can we run here?
The short answer is that we can run up to 5 instances of this
application, because the application’s maximum parallelism is 5. If we
run more than 5 app instances, then the “excess” app instances will
successfully launch but remain idle. If one of the busy instances goes
down, one of the idle instances will resume the former’s work.
You can see more information of threads by set up metrics referring this

Kafka streams threads and count of records being processed

Say we have a Topic with 2 partitions and there are 'n' no of producers which are producing the data to this Topic. Now, Millions of the MessageRecords are being spread over 2 partitions.
Say, we have 2 threads (i.e. 2 separate Instances) powering to the Streams Processor. Now, In this setup, say Thread-1(i.e. Streaming Task-1) got Partition P-1 and say Thread-2(i.e. Streaming Task-2) got Partition P-2 for processing !!
ASK is :- Say, we want to know, how many MessageRecords have been processed by Streaming-Task-1 so far OR say for 28th September, 2KK ?? How do I do that ?
And, even the bigger the question is : "Streaming-Task-1" would never know about the TOTAL count of MessageRecords being processed, it shall only know about the count processed by itself !!
Can it ever know it know about the count processed by another Task-2 ??
There are several ways to accomplish what you are asking. If you are using the DSL I suggest you take a look at the word count example (https://docs.confluent.io/current/streams/quickstart.html). With a map operation you can make all the counts you want relatively simply.
If you are not using the dsl you can still do the same with a couple processors and state stores.

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

How many kafka streams app is recommended to run on single machine in production?

In our architecture, we are assuming to run three jvm processes on one machine (approx.) and each jvm machine can host upto 15 kafka-stream apps.
And if I am not wrong each kafka-stream app spawns one java thread. So, this seems like an awkward architecture to have with around 45 kafka-stream apps running on a single machine.
So, I have question in three parts
1) Is my understanding correct that each kafka-stream app spawns one java thread ? Also, each kafka-stream starts a new tcp connection with kafka-broker ?
2) Is there a way to share one tcp connection for multiple kafka-streams ?
3) Is is difficult(not recommended) to run 45 streams on single machine ?
The answer to this is definitely NO unless there is a real use case in production.
Multiple answers:
a KafkaStreams instance start one processing thread by default (you
can configure more processing threads, too)
internally, KafkaStreams uses two KafkaConsumers and one KafkaProducer
(if you turn on EOS, it uses even more KafkaProducers): a KafkaConsumer
starts a background heartbeat thread and a KafkaProducer starts a
background sender thread => you get 4 threads in total (processing, 2x
heartbeat, sender) -- if you configure two processing threads, you end
up with 8 threads in total, etc)
there is more than one TCP connection as the consumer and the producer
(and the restore consumer, if you enable StandbyTasks) connect to the
cluster
it's not possible to share any TPC connections atm (this would require
a mayor rewrite of consumers and producers)
how many threads you can efficient run depends on your hardware and
workload... monitor you CPU utilization and see how buys your machine is...
Each Kafka stream job spawns a single thread.If the thread number is
set as n numbers it will provide parallelism in processing n number
of Kafka partitions.
If a single machine does not have the capacity to run large number of
threads, parallelism can be achieved by submitting the Streams
applications job with same application name in another machine
in the same cluster. The job will be identified by Kafka
streams and handled in background.
is is difficult(not recommended) to run 45 streams on single machine
? The answer to this is definitely NO unless there is a real use
case in production.--unless your system has these many cores
or the input has 45 partition this is not necessary