How many kafka streams app is recommended to run on single machine in production? - apache-kafka

In our architecture, we are assuming to run three jvm processes on one machine (approx.) and each jvm machine can host upto 15 kafka-stream apps.
And if I am not wrong each kafka-stream app spawns one java thread. So, this seems like an awkward architecture to have with around 45 kafka-stream apps running on a single machine.
So, I have question in three parts
1) Is my understanding correct that each kafka-stream app spawns one java thread ? Also, each kafka-stream starts a new tcp connection with kafka-broker ?
2) Is there a way to share one tcp connection for multiple kafka-streams ?
3) Is is difficult(not recommended) to run 45 streams on single machine ?
The answer to this is definitely NO unless there is a real use case in production.

Multiple answers:
a KafkaStreams instance start one processing thread by default (you
can configure more processing threads, too)
internally, KafkaStreams uses two KafkaConsumers and one KafkaProducer
(if you turn on EOS, it uses even more KafkaProducers): a KafkaConsumer
starts a background heartbeat thread and a KafkaProducer starts a
background sender thread => you get 4 threads in total (processing, 2x
heartbeat, sender) -- if you configure two processing threads, you end
up with 8 threads in total, etc)
there is more than one TCP connection as the consumer and the producer
(and the restore consumer, if you enable StandbyTasks) connect to the
it's not possible to share any TPC connections atm (this would require
a mayor rewrite of consumers and producers)
how many threads you can efficient run depends on your hardware and
workload... monitor you CPU utilization and see how buys your machine is...

Each Kafka stream job spawns a single thread.If the thread number is
set as n numbers it will provide parallelism in processing n number
of Kafka partitions.
If a single machine does not have the capacity to run large number of
threads, parallelism can be achieved by submitting the Streams
applications job with same application name in another machine
in the same cluster. The job will be identified by Kafka
streams and handled in background.
is is difficult(not recommended) to run 45 streams on single machine
? The answer to this is definitely NO unless there is a real use
case in production.--unless your system has these many cores
or the input has 45 partition this is not necessary


High Memory consumption in MessageChannelPartitionHandler in-case of more partitions

Our use case -> Using Remote partitioning - the job is devided into multiple partitions and using active MQ workers are processing these partitions.
Job is failing with memory issue at MessageChannelPartitionHandler handle method where it is holding more number of StepExecution in memory.(we have around 20K StepExecutions/partitions in this case)
we override message channel partition handler for submitting controlled messages to ActiveMQ and even when we try to poll replies from DB it is having database connection timeout issues and when we increased idle connection this approach as well failing to hold all those StepExecutions in memory.
Either case of our Custom/MessageChannelPartitionHandler we are facing similar issues and these step executions are required to aggregate at master. Do we have any alternative way of achieving this.
Can someone help us to understand better way of handling these long running/huge data processing scenarios?

Does Kafka Streams library kill idle StreamThreads?

Say, KStream topology is simple: input topic -> process -> output topic. Partitions of input topic = 4.
If there is a single instance of app running with, all 4 StreamThreads are utilized.
If a second instance is launched (with, stream tasks are now distributed between the two. Task 0_1 and 0_2 on first instance, Task 0_3 and 0_4 on second instance.
On first instance, does kafka streams library kill the threads which were running 0_3 and 0_4 so far?
For your case when input topic has only 4 partitions, what will happen When starting 8 instances with
4 instances become idle but not be killed. They are remaining and get assign a task if any of other already assigned instance goes down.
So, same thing happens when you start multiple treads in one instances. In your case, 8 treads in 2 instances 4 per each. same scenario happen I will explained above. 4 of your threads getting idle and remain idle until it is getting a task by going down other instance.
more reference :
Let’s take an example. Imagine your application is reading from an input
topic that has 5 partitions. How many app instances can we run here?
The short answer is that we can run up to 5 instances of this
application, because the application’s maximum parallelism is 5. If we
run more than 5 app instances, then the “excess” app instances will
successfully launch but remain idle. If one of the busy instances goes
down, one of the idle instances will resume the former’s work.
You can see more information of threads by set up metrics referring this

ActiveMQ Artemis. Reliable cluster with synchronous replication

I want to configure a cluster with the following expected behavior:
Сluster must be HA ( 3 nodes at least).
I have queues in which it is important to maintain processing order. The consumer always reads this queue in a single thread. If he took the message, then we consider our task completed.
I don't need load balancing - it is important for me to maintain the order of messages.
I want to avoid split-brain.
If we have 3 nodes, then if 1 of the nodes fails, the cluster should continue to work.
I tried following configurations:
master + slave + slave with replication.
It works. But does not solve the problem of split brain
master + slave + slave + Pinger
As far as I understand, this does not give a 100% guarantee of detecting network problems. We can also get split-brain.
3 pairs of live/backup nodes.
This is solved split brain problem but how can we avoid the following situation:
Producer send message to group A in queue (where important to maintain processing order)
Group A crashed ( 1/3 of all nodes 2/6)
The message stored in the journal of group A
Cluster continue to work;
Producer send message to group B in queue (where important to maintain processing order)
Consumer got this message first; We did not support the required message order.
How should I build a cluster to solve these problems?
You can't achieve the behavior you want using replication. You need to use a shared store between the nodes. If you must use 3 nodes then I would recommend master + slave + slave. Otherwise I'd recommend master + slave.
Also, for what it's worth, replication is not synchronous within the broker. It is asynchronous and non-blocking. However, it is still reliable. For example, when a broker is configured for HA with replication and it receives a durable message from a client it will persist that message to disk and send it to the replicated backup concurrently without blocking. However, it will wait for both operations to finish before responding to the client that it has received the message. This allows much greater message throughput than using a synchronous architecture internally although the whole process will appear to be synchronous to external clients.
Also, it's worth noting that work is underway to change how replication works to make it more robust against split brain and to enable a single master + slave pair that is suitable for production use.

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
return conf;
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

How to design task distribution with ZooKeeper

I am planning to write an application which will have distributed Worker processes. One of them will be Leader which will assign tasks to other processes. Designing the Leader elelection process is quite simple: each process tries to create a ephemeral node in the same path. Whoever is successful, becomes the leader.
Now, my question is how to design the process of distributing the tasks evenly? Any recipe for this?
I'll elaborate a little on the environment setup:
Suppose there are 10 worker maschines, each one runs a process, one of them become leader. Tasks are submitted in the queue, the Leader takes them and assigns to a worker. The worker processes gets notified whenever a tasks is submitted.
I am not sure I understand your algorithm for Leader election, but the recommended way of implementing this is to use sequential ephemeral nodes and use the algorithm at which explains how to avoid the "herd" effect.
Distribution of tasks can be done with a simple distributed queue and does not strictly need a Leader. The producer enqueues tasks and consumers keep a watch on the tasks node - a triggered watch will lead the consumer to take a task and delete the associated znode. There are certain edge conditions to consider with requeuing tasks from failed consumers.
I would recommend the section Example: Master-Worker Application of this book ZooKeeper Distributed Process Coordination
The example demonstrates to distribute tasks to worker using znodes and common zookeeper commands.
Consider using an actor singleton service pattern. For example, in Scala there is Akka which solves this class of problem with less code.