I have a Kafka Streams application that consumes from topic 'A' with 10 partitions with around 10k messages per second. I am confused about what will be better for my application.
To run multiple Kafka Streams application instances with same consumer group.
OR
To run single kafka streams application with more num.stream.threads
As mentioned in the confluent blog
The maximum parallelism at which your application may run is bounded
by the maximum number of stream tasks, which itself is determined by
maximum number of partitions of the input topic(s) the application is
reading from. For example, if your input topic has 5 partitions, then
you can run up to 5 applications instances.
So there is no difference when you run 10 application instance or a single streams application with 10 thread in processing of messages. Except when you run 10 application instance you can run it on different JVMs spread across different machines which can help in some throughput improvement.
Also see this
Related
I understand that kafka consumer group concept enables us to come up with parallel processing of partitions by the underlying consumers within the same consumer group, meaning if I spin up multiple consumer objects each belonging to the same consumer group-id would have load balanced across its underlying consumer instances, by using various partition assignment strategy configs which are natively available.
We were trying out with RoundRobin assignment strategy config; created an application which subscribes with 2 topics ( say topic-1 and topic-2, each having 10 partitions ) and created 10 consumer objects per topic, so that aspiration is that we have got 20 consumer objects processing from each of the partitions (total 20 ) of 2 topics; so far so good when we run one instance of this application as each consumer is attached to one partition of a topic.
When we try to spin up another instance of this application, where it again does the same thing by attaching itself to same consumer-group-id, such that we will now have total 40 consumer objects overall and they get load balanced in such a way that 10 consumers from earlier instance of the application(1) released their partition assignments to this new instance (2) and seeming to be having an equal distribution among both these instances where each is sharing half of the load ( 10 consumers from each instance processing messages from 10 partitions each and 10 consumers in each instance staying idle)
Now the fun starts when we’re trying to spin up 3rd instance of the application, we expected the load balance to be something like 7, 7 and 6 for each application instance, however it is turning out to be 7, 8 and 5 and sometimes even random allocations, which I don’t consider as fair equal distribution.
We understand that we can not have more consumers than the partitions available out there, were looking to have some fine load balancing across the application instances sharing the same consumer group id to not to overburden one particular instance as such.
Are we missing any config here or some fundamental understanding, please guide, many thanks!
I'm working on a project that's implemented with spring-cloud-stream on top of apache kafka. We are currently tuning the kafka configs to make the application more performant.
One thing I noticed is this num.stream.threads config for the apache streams application. Upon reading the documentation, my understanding for this config is that once this value is set, the streams application will have these number of threads to process the kafka streams.
However when reading the log of the application I saw that these stream threads also assigned to the input topic partitions.
Here are my questions:
If stream threads are responsible for consuming from input topics, we cannot have more stream threads than the total partition number of input topic.
If my assumption for the first question is correct, then what about multiple input topics for one stream application. Let's say we have a stream application that consumes from 2 input topics and produce to 1 output topic. Both input topics were created with 12 partitions. Once I assigned num.stream.threads with 12 threads, then each stream thread consumer will be consuming from 2 partitions, one from each topic right?
I am using the Kafka Streams Processor API to construct a Kafka Streams application to retrieve messages from a Kafka topic. I have two consumer applications with the same Kafka Streams configuration. The difference is only in the message size. The 1st one has messages with 2000 characters (3KB) while 2nd one has messages with 34000 characters (60KB).
Now in my second consumer application I am getting too much lag which increases gradually with the traffic while my first application is able to process the messages at the same time without any lag.
My Stream configuration parameters are as below,
application.id=Application1
default.key.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde=org.apache.kafka.common.serialization.Serdes$StringSerde
num.stream.threads=1
commit.interval.ms=10
topology.optimization=all
Thanks
In order to consume messages faster, you need to increase the number of partitions (if it's not yet done, depending on the current value), and do one of the following two options:
1) increase the value for the config num.stream.threads within your application
or
2) start several applications with the same consumer group (the same application.id).
as for me, increasing num.stream.threads is preferable (until you reach the number of CPUs of the machine your app runs on). Try gradually increasing this value, e.g go from 4 over 6 to 8, and monitor the consumer lag of your application.
By increasing num.stream.threads your app will be able to consume messages in parallel, assuming you have enough partitions.
I have 2 instances of my application for kafka streams consuming 2 partitions in a single topic.
will the single partitions data be only in one application or both applications? and Say if one applications instance is down will i have issues. how will interactive queries solve this ?
do i need to use globalktable?
Each kafka stream application instance will be mapped to one or more partition, based on how many partitions the input topics have.
If you run 2 instances for an input topic with 2 partitions, each partition will consume from one partition. If one instance goes down, kafka stream will rebalance the work load on the first instance and it will consumer from both partition.
You can refer the architecture here in detail : https://docs.confluent.io/current/streams/architecture.html
I have a consumer that is supposed to read messages from a topic. This consumer actually reads the messages and writes them to a time series database. We have multiple instances of the time series database running as a cluster on multiple physical machines.
Our plan is to deploy the consumer on all those machines where the time series service is running. So if I have 5 nodes on which the time series service is running, I will install one consumer instance per node. All those consumer instances belong to the same consumer group. So in pictures the set up looks like below:
As you can see, the Producer P1 and P2 write into 2 partitions namely partition 1 and partition 2 of the kafka topic. I then have 4 instances of the time series service where one consumer is running per instance. How should I read using my consumer properly such that I do not end up with duplicate messages in my time series database?
Edit: After reading through the Kafka documentation, I came across these two statements:
If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
So in my case above, it is behaving like a Queue? Is my understanding correct?
If all consumers belong to one group (have the same groupId), then kafka topic will behave for you as a queue.
Important: there is no reason to have consumers more than partitions, as consumers (out-of-the-box kafka consumers) are scaled by partitions.