consume kafka topics with different partition numbers - apache-kafka

Hi i have a kafka consumer (using spring kafka dependency) that listens to multiple topics. Lets say i have 3 topics which are topicA, topicB and topicC. In my application i consume all three topics in one consumer like below.
#KafkaListener(topics = "topicA,topicB,topicC", groupId = "myGroup", concurrency="3")
My topics have partitions and those number of partitions are deferent from each. Lets say my topicA has 3 partitions. topicB have 6 partitions and topicC has 9 partitions. How should i determine a number for "concurrency" option in #KafkaListener. (I'm confused since topicB and topicC contain 6 and 9 partitions respectively. So should i change the concurrency to 6 or 9 ? or should i change it to 18 which is total number of partitions from 3 topics)
I know that on the consumer side, Kafka always gives a single partition’s data to one consumer thread and the degree of parallelism in the consumer (within a consumer group) is bounded by the number of partitions being consumed.
My main goal is to consume parallelly by using concurrency option in #kafkalistener

If you set the concurrency to 18, with the default partition assignor, if the concurrency is greater than the number of partitions, you will have idle consumers. The partitions from different topics have no bearing on how the partitions are distributed.
You can use a custom partition assignor (in the consumer configuration) to distribute the partitions differently.
See https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy
Also see the discussion about RoundRobinAssignor here https://docs.spring.io/spring-kafka/docs/current/reference/html/#using-ConcurrentMessageListenerContainer
Or, simply add 3 separate #KafkaListener annotations to the method, one for each topic, with different concurrencies.

Related

Does Kafka rebalancing algorithm balance across topics?

Does Kafka rebalancing algorithm work across topics?
Suppose I have 5 topics, each with 10 partitions, and 20 instances of consumer application in the same consumer group subscribing each to these 5 topics.
Will Kafka try to balance 50 partitions evenly across 20 instances?
Or will it balance only within a topic, and thus 10 first instances may (or likely to) receive all 50 partitions, while 10 other instances may stay idle?
I know that in older days Kafka did not balance across topics, but what about current versions?
The assignment of consumer instances to partitions depends on the Consumer Configuration partition.assignment.strategy. Its default value is class org.apache.kafka.clients.consumer.RangeAssignor but you can also select RoundRobinAssignor, StickyAssignor or you can even build your own strategy by extending the abstract class AbstractPartitionAssignor.
I think for your case the RoundRobin assignment strategy would lead to a more balanced asignment. The difference between the strategies Range and RoundRobin are depicted in the diagram below.
In your case (having 10 partitions in each topic and 20 consumer instances) the Range strategy would lead to 10 instances being idle. However, using the RoundRobin strategy would keep all instances busy as it follows the principle: The partitions will be uniformly distributed in that the largest
difference between assignments should be one partition.
Please note that consumer assignment to topic partitions is different to a Rebalance. A Rebalance is initiated when
A consumer leave the Consumer Group (eg.g by failing to send a heartbeat or by explicitly requesting to leave)
A new consumer joins the ConsumerGroup
A consumer changes its topic subscriptions
a change in the subscribed topic such as increase/decrease of partitions.
During a rebalance the consumption is paused for the entire consumerGroup and the assignment is happening again based on your selected strategy.
You can choose RoundRobin as partition assignor instead of default Range assignment to get all instances consuming.
Range Assignor:
Range assignor works on each topic, and it will divide partitions into several ranges based on the total number of consumer. Then all consumers will be sorted by lexicographic order and each consumer will take a range of partitions.
For you case, you have 10 partitions for each topics and total 20 consumers. Then coordinator will assign 1 partition for each of first 10 consumers. In this case, you will get 10 idle consumers.
And the same thing happens for each topic, so you will get first 10 consumers has been assigned 5 partitions(1 for each topic) and other 10 will be idle.
Round-Robin Assignor:
Round-Robin assignor will list all partitions for all topics subscribed by consumer group. And each consumer will take partitions round-robin.
For you case, coordinator will list all partitions like:
t1p1, t1p2, t1p3 ... t5p9, t5p10
And all 20 consumers will take partitions in this order, so finally you will get:
Consumer1: t1p1, t3p1, t5p1
Consumer2: t1p2, t3p2, t5p2
.
.
.
Consumer 10: t2p10, t4p10
It could be more balanced than Range Assignor.

Number of consumers in kafka comsumer-group

If a producer has 3 topics and 4 partitions each topic, should the consumer group contains 4 or 12 consumers?
I want to achieve ideal consumption.
There should be one consumer each partition for ideal consumption. So, for your case, 12 consumers should be ideal.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
You cannot have multiple consumers -within the same consumer group- consuming data from a single partition. Therefore, in order to consume data from the same partition using N consumers, you'd need to create N distinct consumer groups too.
Note that partitioning enhances the parallelism within a Kafka cluster. If you create thousands of consumers to consume data from only one partition, I suspect that you will lose some level of parallelism.
If you have 3 topics with 4 partition each.
For best optimisation you should have 4 consumers per consumer group.
Reason : If you have more than 4 consumers ,your extra consumers would be left ideal, because 4 consumers will be assigned 4 partitions with 1 consumer assigned 1 partition. So in short more than 4 consumers is not required per consumer group.
If you have less consumers say 2 consumers for 4 topics , each consumer will consume messages from 2 partitions each which will overload it.
There is no limit in number of consumer groups which subscribe to a topic.

How to dynamically add consumers in consumer group kafka

How should I know when i have to scale the consumer in consumer group . What are the triggers for the consumers to scale when there is a fast producer ?
One straight forward approach would be to get the consumer lag(this can be computed as the difference between committed offset and beginning_offset) and if the lag computed in the last n times is increasing you can scale up and vice versa. You might've to consider some edge cases for example in case consumers have gone down and lag would be increasing and the auto-scaling function might spawn more threads/machines).
In Kafka while creating a topic, need to provide number of partitions and replication factor.
Let say there is one topic called TEST with 10 partitions, for parallel consumption of data need to create consumer group with 10 consumers, where each consumer will be consuming the data from the respective partition.
Here is the catch, if the topic is having 10 partitions and consumer group is having 12 consumers then two consumer remain idle until one of the consumer dies.
if the topic is having 10 partitions and consumer group has 8 consumers then 6 consumers will consume the data from 6 partitions (one consumer->one partition) whereas remaining two consumers will be responsible for consuming the data from two partitions (one consumer-> 2 partitions). its means last two-consumers consumes the data from four partitions.
Hence first thing is to decide number of partition for your kafka topic, more partitions means more parallelism.
whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka.
Actually auto-scale is not a good idea because in Kafka message order is guaranteed in partition.
From Kafka docs:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent
by the same producer as a record M2, and M1 is sent first, then M1
will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
If you add more partitions and more consumers with respect to number of partitions, then you cannot satisfy ordering guarantee of messages.
Suppose that you have 10 partitions and your number of key is 102, then this message will be sent to partition: 102 % 10 = 2
But if you increase number of partitions to 15 for instance, then messages with same key (102) will be sent to a different partition: 102 % 15 = 12
As you see with this approach it is impossible to guarantee ordering of the messages with same keys.
Note: By the way Kafka uses murmur2(record.key())) % num partitions algorithm by default. The calculations above is just an example.

Kafka PartitionStrategy + horizontal scaling

I'm confused to what degree partition assignment is a client side concern partition.assignment.strategy and what part is handled by Kafka.
For example, say I have one kafka topic with 100 partitions.
If I make 1 app that runs 5 threads of consumers, with a partition.assignment.strategy of RangeAssignor then I should get 5 consumers each consuming 25 partitions.
Now if I scale this app by deploying it 4 times, and using the same consumer group. Will kafka first divide 25 partitions to each of these apps on its side, and only then are these 25 partitions further subdivided by the app using the PartitionStrategy?
Which would result neatly in 4 apps with 5 consumers each, consuming 5 partitions each.
The behavior of the default Assignors is well documented in the Javadocs.
RangeAssignor is the default Assignor, see its Javadoc for example of assignment it generates: http://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/RangeAssignor.html
If you have 20 consumers using RangeAssignor that are consuming from a topic with 100 partitions, each consumer will be assigned 5 partitions.
Because RangeAssignor assigns partitions topic by topic, it can create really unbalanced assignments if you have topics with very few partitions. In that case, RoundRobinAssignor works better
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if any one of the following events are triggered:
Number of partitions change for any of the subscribed topics
A subscribed topic is created or deleted
An existing member of the consumer group is shutdown or fails.
A new member is added to the consumer group.
Most likely point no. 4 is your case and the strategy used will be the same(partition.assignment.strategy). Not that this is not applicable if you have explicitly specified the partition to be consumed by your consumer

Kafka Consumer distribution not working as expected

I have Three topics each having three partitions on a cluster of kafka.
now, there are total 9 partitions. and when i create 9 consumers... the 6 are being idle. only three consumers are being used.
the expectation is: each consumer should pickup one partitions and hence, 9 consumer should pick up documents from 9 partitions
but what happens is:
one consumer picks up messages from three paritions one of different topic.
e.g. i have three topics Topic_A,Topic_B and Topic_C and three partitions each. hence parititions are as below:
Topic_A_0, Topic_A_1, Topic_A_2, Topic_B_0, Topic_B_1, Topic_B_2,
Topic_C_0, Topic_C_1, Topic_C_2
When i create 9 consumers,
the distribution works as below:
Consumer1: Topic_A_0,Topic_B_0,Topic_C_0
Consumer2: Topic_A_1,Topic_B_1,Topic_C_1
Consumer3: Topic_A_2,Topic_B_2,Topic_C_2
Consumer4,Consumer5,Consumer6,Consumer7,Consumer8,Consumer9 are idle
It should be
Consumer1: Topic_A_0
Consumer2: Topic_A_1
Consumer3: Topic_A_2
Consumer4: Topic_B_0
Consumer5: Topic_B_1
Consumer6: Topic_B_2
Consumer7: Topic_C_0
Consumer8: Topic_C_1
Consumer9: Topic_C_2
Is there any configuration i need to let all 9 consumer pick up messages from 9 unique parititons?
Make sure your all your consumers are subscribing to same set of topics under the same consumer group id. For the list of topics, you can pass a predefined list or a regular expression for consumers to subscribe from. The consumer-id can be set using group.id property in consumer.
The default partition assignment strategy doesn't work across topics so this is the expected behaviour. A similar question here : Kafka Consumers are balanced across topics