Kafka Consumer distribution not working as expected - apache-kafka

I have Three topics each having three partitions on a cluster of kafka.
now, there are total 9 partitions. and when i create 9 consumers... the 6 are being idle. only three consumers are being used.
the expectation is: each consumer should pickup one partitions and hence, 9 consumer should pick up documents from 9 partitions
but what happens is:
one consumer picks up messages from three paritions one of different topic.
e.g. i have three topics Topic_A,Topic_B and Topic_C and three partitions each. hence parititions are as below:
Topic_A_0, Topic_A_1, Topic_A_2, Topic_B_0, Topic_B_1, Topic_B_2,
Topic_C_0, Topic_C_1, Topic_C_2
When i create 9 consumers,
the distribution works as below:
Consumer1: Topic_A_0,Topic_B_0,Topic_C_0
Consumer2: Topic_A_1,Topic_B_1,Topic_C_1
Consumer3: Topic_A_2,Topic_B_2,Topic_C_2
Consumer4,Consumer5,Consumer6,Consumer7,Consumer8,Consumer9 are idle
It should be
Consumer1: Topic_A_0
Consumer2: Topic_A_1
Consumer3: Topic_A_2
Consumer4: Topic_B_0
Consumer5: Topic_B_1
Consumer6: Topic_B_2
Consumer7: Topic_C_0
Consumer8: Topic_C_1
Consumer9: Topic_C_2
Is there any configuration i need to let all 9 consumer pick up messages from 9 unique parititons?

Make sure your all your consumers are subscribing to same set of topics under the same consumer group id. For the list of topics, you can pass a predefined list or a regular expression for consumers to subscribe from. The consumer-id can be set using group.id property in consumer.

The default partition assignment strategy doesn't work across topics so this is the expected behaviour. A similar question here : Kafka Consumers are balanced across topics

Related

Will consumers in a group subscribe to one topic each, if there is a single partition per topic?

I'm using Debezium to log changes in my database, and Debezium generates change events in a topic for each table that exists in my database.These change records are consumed to populate another database.
If I restrict each topic to only have 1 partition, and let's say I have 4 consumers running, when the consumers subscribe to topics, will the 4 consumers divide the topics among themselves? (they would distribute partitions among themselves, but here 1 topic = 1 partition)
Would the above setup mean that for each table, the generated events on the topic will always be executed in order because there's at most 1 consumer acting on the topic at the same time?
I'm currently trying to get Kafka to restrict 1 partition per topic, and have 2 consumers. But the 2 consumers seem to pick up different topics every now and then and not be 'sticky' to the topics.
Yes, if there are 4 topics, 1 partition in each topic and 4 consumers, the consumers will evenly distribute the partitions among themselves which will result in 1 topic per consumer.
However, to get a "sticky assignment" you would need to give the consumer groups static group IDs. Otherwise, when there are failures and a rebalance is triggered, the consumer can be assigned a different partition (in this case a different topic).

consume kafka topics with different partition numbers

Hi i have a kafka consumer (using spring kafka dependency) that listens to multiple topics. Lets say i have 3 topics which are topicA, topicB and topicC. In my application i consume all three topics in one consumer like below.
#KafkaListener(topics = "topicA,topicB,topicC", groupId = "myGroup", concurrency="3")
My topics have partitions and those number of partitions are deferent from each. Lets say my topicA has 3 partitions. topicB have 6 partitions and topicC has 9 partitions. How should i determine a number for "concurrency" option in #KafkaListener. (I'm confused since topicB and topicC contain 6 and 9 partitions respectively. So should i change the concurrency to 6 or 9 ? or should i change it to 18 which is total number of partitions from 3 topics)
I know that on the consumer side, Kafka always gives a single partition’s data to one consumer thread and the degree of parallelism in the consumer (within a consumer group) is bounded by the number of partitions being consumed.
My main goal is to consume parallelly by using concurrency option in #kafkalistener
If you set the concurrency to 18, with the default partition assignor, if the concurrency is greater than the number of partitions, you will have idle consumers. The partitions from different topics have no bearing on how the partitions are distributed.
You can use a custom partition assignor (in the consumer configuration) to distribute the partitions differently.
See https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy
Also see the discussion about RoundRobinAssignor here https://docs.spring.io/spring-kafka/docs/current/reference/html/#using-ConcurrentMessageListenerContainer
Or, simply add 3 separate #KafkaListener annotations to the method, one for each topic, with different concurrencies.

Kafka PartitionStrategy + horizontal scaling

I'm confused to what degree partition assignment is a client side concern partition.assignment.strategy and what part is handled by Kafka.
For example, say I have one kafka topic with 100 partitions.
If I make 1 app that runs 5 threads of consumers, with a partition.assignment.strategy of RangeAssignor then I should get 5 consumers each consuming 25 partitions.
Now if I scale this app by deploying it 4 times, and using the same consumer group. Will kafka first divide 25 partitions to each of these apps on its side, and only then are these 25 partitions further subdivided by the app using the PartitionStrategy?
Which would result neatly in 4 apps with 5 consumers each, consuming 5 partitions each.
The behavior of the default Assignors is well documented in the Javadocs.
RangeAssignor is the default Assignor, see its Javadoc for example of assignment it generates: http://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/RangeAssignor.html
If you have 20 consumers using RangeAssignor that are consuming from a topic with 100 partitions, each consumer will be assigned 5 partitions.
Because RangeAssignor assigns partitions topic by topic, it can create really unbalanced assignments if you have topics with very few partitions. In that case, RoundRobinAssignor works better
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if any one of the following events are triggered:
Number of partitions change for any of the subscribed topics
A subscribed topic is created or deleted
An existing member of the consumer group is shutdown or fails.
A new member is added to the consumer group.
Most likely point no. 4 is your case and the strategy used will be the same(partition.assignment.strategy). Not that this is not applicable if you have explicitly specified the partition to be consumed by your consumer

Kafka Consumer from different group consuming from different partition of Topic

I have a scenario where I have deployed 4 instances of Kafka Consumer on different nodes. My topic has 4 partitions. Now, I want to configure the Consumers in such a way that they all fetch from different partitions of the topic.
I know for a fact that if the Consumers are from the same consumer group, they ensure that the partitions are split equally. But in my case, they are not in the same group.
In order to achieve what you want you need the consumers being in the same consumer group. Only in this case a "competing consumer" pattern is applied : each consumer receives 1 partition from the 4, so you have 4 consumers each one reading from 1 partition and receiving messages for that partitions.
When consumers are part of different consumer groups, each consumer will be assigned to all 4 partitions receiving messages from all of them in a publish/subscribe way.

How kafka partitions behave

Can you explain how kafka partitions works for this scenario
If i produce 9 (1-9) messages round robin with 1 topic & 3 partitions.
Does it means that:
Partition 1 contains: [1,4,7]
Partition 2 contains: [2,5,8]
Partition 3 contains: [3,6,9]
?
Also how many consumers can get all the data 3? why?
Can you explain?
I guess also that consumer group can solve it but not sure why
Can you explain how kafka partitions works for this scenario
Your understanding is correct.
Also how many consumers can get all the data 3? why?
Depends on how many consumers you have in your consumer group.
If you only have 1 consumer in a group, it will get all the messages from all partitions.
If you have 2 consumers in a group, each will claim a subset of the partitions, e.g. 1st consumer will get all messages from partitions 1 and 2 and the 2nd consumer will get messages from partition 3.
If you have 3 consumers in a group, each will get one partition assigned.
If you have more than 3 consumers in a group, 3 consumers will get one partition each and the remaining consumers will not get any messages, just act as redundancy in case of failover.
The distribution of messages in the partitions is correct if and only if you publish messages without keys. In Kafka it is common to publish messages as (Key, Value) pairs and if you produce messages this way then the default partitioner will ensure that all messages of the same key will get put in the same partition. It does this by using a hashing function on each of the keys that maps to one of the available partitions. In the extreme case where all your messages have the same key, then they would all go to the same partition. If your messages all had either a string key "foo" or a key called "bar" then all the messages with key "foo" may go to partition 3 and all the messages with key "bar" may go to partition 1.
In terms of your question about consumers, you can have an unlimited number of consumers. If each consumer has a unique group.id then they are considered independent and they will each get their own full set of the messages from all partitions.
However if you have consumers that share the same group.id then they are said to be in a consumer group and each will get an exclusive and roughly equal subset of the partitions. If you had 3 consumers in the same group they would get 1 partition each. If you added any more than 3 consumers in the same group then the first 3 will get 1 partition each and all the others will be standby consumers than only become active if one of the 3 active consumers leaves the group.
The distribution of the messages through the partitions is correct in the idea. The partitions are the paralelism unit of Kafka.
You can have 3 consumers which will each handle one partition, but you can also have only 1 consumer which will get the data from the 3 partitions. It depends on the throughput you can have/want for each consumer.
Concerning the consumer groups :
If all your consumers have the same consumer group, the messages will be load balanced over the consumers
If your consumers have different consumer groups, then each messages will be broadcast to all consumer processes
FYI : the messages order is only kept within a partition, that is why messages coming from different partitions could be unordered.