I'm working on a project that's implemented with spring-cloud-stream on top of apache kafka. We are currently tuning the kafka configs to make the application more performant.
One thing I noticed is this num.stream.threads config for the apache streams application. Upon reading the documentation, my understanding for this config is that once this value is set, the streams application will have these number of threads to process the kafka streams.
However when reading the log of the application I saw that these stream threads also assigned to the input topic partitions.
Here are my questions:
If stream threads are responsible for consuming from input topics, we cannot have more stream threads than the total partition number of input topic.
If my assumption for the first question is correct, then what about multiple input topics for one stream application. Let's say we have a stream application that consumes from 2 input topics and produce to 1 output topic. Both input topics were created with 12 partitions. Once I assigned num.stream.threads with 12 threads, then each stream thread consumer will be consuming from 2 partitions, one from each topic right?
Related
Kafka version: 1.0.0
Let's say the stream application uses low level processor API which maintains the state and reads from a topic with 10 partitions. Please clarify if the internal topic is expected to be created with the same number of partitions OR is it per the broker default. If it's the later, if we need to increase the partitions of the internal topic, is there any option?
Kafka Streams will create the topic for you. And yes, it will create it with the same number of partitions as your input topic. During startup, Kafka Streams also checks if the topic has the expected number of partitions and fails if not.
The internal topic is basically a regular topic as any other and you can change the number of partitions via command line tools like for any other topic. However, this should never be required. Also note, that dropping/adding partitions, will mess up your state.
I am studying kafka streams, table, globalktable etc. Now I am confusing about that.
What exactly is GlobalKTable?
But overall if I have a topic with N-partitions, and one kafka stream, after I send some data on the topic how much stream (partition?) will I have?
I made some tries and I notice that the match is 1:1. But what if I make topic replicated over different brokers?
Thank you all
I'll try to answer your questions as you have them listed here.
A GlobalKTable has all partitions available in each instance of your Kafka Streams application. But a KTable is partitioned over all of the instances of your application. In other words, all instances of your Kafka Streams application have access to all records in the GlobalKTable; hence it used for more static data and is used more for lookup records in joins.
As for a topic with N-partitions, if you have one Kafka Streams application, it will consume and work with all records from the input topic. If you were to spin up another instance of your streams application, then each application would process half of the number of partitions, giving you higher throughput due to the parallelization of the work.
For example, if you have input topic A with four partitions and one Kafka Streams application, then the single application processes all records. But if you were to launch two instances of the same Kafka Streams application, then each instance will process records from 2 partitions, the workload is split across all running instances with the same application-id.
Topics are replicated across different brokers by default in Kafka, with 3 being the default level of replication. A replication level of 3 means the records for a given partition are stored on the lead broker for that partition and two other follower brokers (assuming a three-node broker cluster).
Hope this clears things up some.
-Bill
I have a Kafka Streams application that consumes from topic 'A' with 10 partitions with around 10k messages per second. I am confused about what will be better for my application.
To run multiple Kafka Streams application instances with same consumer group.
OR
To run single kafka streams application with more num.stream.threads
As mentioned in the confluent blog
The maximum parallelism at which your application may run is bounded
by the maximum number of stream tasks, which itself is determined by
maximum number of partitions of the input topic(s) the application is
reading from. For example, if your input topic has 5 partitions, then
you can run up to 5 applications instances.
So there is no difference when you run 10 application instance or a single streams application with 10 thread in processing of messages. Except when you run 10 application instance you can run it on different JVMs spread across different machines which can help in some throughput improvement.
Also see this
I have 2 instances of my application for kafka streams consuming 2 partitions in a single topic.
will the single partitions data be only in one application or both applications? and Say if one applications instance is down will i have issues. how will interactive queries solve this ?
do i need to use globalktable?
Each kafka stream application instance will be mapped to one or more partition, based on how many partitions the input topics have.
If you run 2 instances for an input topic with 2 partitions, each partition will consume from one partition. If one instance goes down, kafka stream will rebalance the work load on the first instance and it will consumer from both partition.
You can refer the architecture here in detail : https://docs.confluent.io/current/streams/architecture.html
Kafka version: 1.0.0
Let's say the stream application uses low level processor API which maintains the state and reads from a topic with 10 partitions. Please clarify if the internal topic is expected to be created with the same number of partitions OR is it per the broker default. If it's the later, if we need to increase the partitions of the internal topic, is there any option?
Kafka Streams will create the topic for you. And yes, it will create it with the same number of partitions as your input topic. During startup, Kafka Streams also checks if the topic has the expected number of partitions and fails if not.
The internal topic is basically a regular topic as any other and you can change the number of partitions via command line tools like for any other topic. However, this should never be required. Also note, that dropping/adding partitions, will mess up your state.