Does Kafka rebalancing algorithm work across topics?
Suppose I have 5 topics, each with 10 partitions, and 20 instances of consumer application in the same consumer group subscribing each to these 5 topics.
Will Kafka try to balance 50 partitions evenly across 20 instances?
Or will it balance only within a topic, and thus 10 first instances may (or likely to) receive all 50 partitions, while 10 other instances may stay idle?
I know that in older days Kafka did not balance across topics, but what about current versions?
The assignment of consumer instances to partitions depends on the Consumer Configuration partition.assignment.strategy. Its default value is class org.apache.kafka.clients.consumer.RangeAssignor but you can also select RoundRobinAssignor, StickyAssignor or you can even build your own strategy by extending the abstract class AbstractPartitionAssignor.
I think for your case the RoundRobin assignment strategy would lead to a more balanced asignment. The difference between the strategies Range and RoundRobin are depicted in the diagram below.
In your case (having 10 partitions in each topic and 20 consumer instances) the Range strategy would lead to 10 instances being idle. However, using the RoundRobin strategy would keep all instances busy as it follows the principle: The partitions will be uniformly distributed in that the largest
difference between assignments should be one partition.
Please note that consumer assignment to topic partitions is different to a Rebalance. A Rebalance is initiated when
A consumer leave the Consumer Group (eg.g by failing to send a heartbeat or by explicitly requesting to leave)
A new consumer joins the ConsumerGroup
A consumer changes its topic subscriptions
a change in the subscribed topic such as increase/decrease of partitions.
During a rebalance the consumption is paused for the entire consumerGroup and the assignment is happening again based on your selected strategy.
You can choose RoundRobin as partition assignor instead of default Range assignment to get all instances consuming.
Range Assignor:
Range assignor works on each topic, and it will divide partitions into several ranges based on the total number of consumer. Then all consumers will be sorted by lexicographic order and each consumer will take a range of partitions.
For you case, you have 10 partitions for each topics and total 20 consumers. Then coordinator will assign 1 partition for each of first 10 consumers. In this case, you will get 10 idle consumers.
And the same thing happens for each topic, so you will get first 10 consumers has been assigned 5 partitions(1 for each topic) and other 10 will be idle.
Round-Robin Assignor:
Round-Robin assignor will list all partitions for all topics subscribed by consumer group. And each consumer will take partitions round-robin.
For you case, coordinator will list all partitions like:
t1p1, t1p2, t1p3 ... t5p9, t5p10
And all 20 consumers will take partitions in this order, so finally you will get:
Consumer1: t1p1, t3p1, t5p1
Consumer2: t1p2, t3p2, t5p2
.
.
.
Consumer 10: t2p10, t4p10
It could be more balanced than Range Assignor.
Related
If a producer has 3 topics and 4 partitions each topic, should the consumer group contains 4 or 12 consumers?
I want to achieve ideal consumption.
There should be one consumer each partition for ideal consumption. So, for your case, 12 consumers should be ideal.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
You cannot have multiple consumers -within the same consumer group- consuming data from a single partition. Therefore, in order to consume data from the same partition using N consumers, you'd need to create N distinct consumer groups too.
Note that partitioning enhances the parallelism within a Kafka cluster. If you create thousands of consumers to consume data from only one partition, I suspect that you will lose some level of parallelism.
If you have 3 topics with 4 partition each.
For best optimisation you should have 4 consumers per consumer group.
Reason : If you have more than 4 consumers ,your extra consumers would be left ideal, because 4 consumers will be assigned 4 partitions with 1 consumer assigned 1 partition. So in short more than 4 consumers is not required per consumer group.
If you have less consumers say 2 consumers for 4 topics , each consumer will consume messages from 2 partitions each which will overload it.
There is no limit in number of consumer groups which subscribe to a topic.
How should I know when i have to scale the consumer in consumer group . What are the triggers for the consumers to scale when there is a fast producer ?
One straight forward approach would be to get the consumer lag(this can be computed as the difference between committed offset and beginning_offset) and if the lag computed in the last n times is increasing you can scale up and vice versa. You might've to consider some edge cases for example in case consumers have gone down and lag would be increasing and the auto-scaling function might spawn more threads/machines).
In Kafka while creating a topic, need to provide number of partitions and replication factor.
Let say there is one topic called TEST with 10 partitions, for parallel consumption of data need to create consumer group with 10 consumers, where each consumer will be consuming the data from the respective partition.
Here is the catch, if the topic is having 10 partitions and consumer group is having 12 consumers then two consumer remain idle until one of the consumer dies.
if the topic is having 10 partitions and consumer group has 8 consumers then 6 consumers will consume the data from 6 partitions (one consumer->one partition) whereas remaining two consumers will be responsible for consuming the data from two partitions (one consumer-> 2 partitions). its means last two-consumers consumes the data from four partitions.
Hence first thing is to decide number of partition for your kafka topic, more partitions means more parallelism.
whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka.
Actually auto-scale is not a good idea because in Kafka message order is guaranteed in partition.
From Kafka docs:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent
by the same producer as a record M2, and M1 is sent first, then M1
will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
If you add more partitions and more consumers with respect to number of partitions, then you cannot satisfy ordering guarantee of messages.
Suppose that you have 10 partitions and your number of key is 102, then this message will be sent to partition: 102 % 10 = 2
But if you increase number of partitions to 15 for instance, then messages with same key (102) will be sent to a different partition: 102 % 15 = 12
As you see with this approach it is impossible to guarantee ordering of the messages with same keys.
Note: By the way Kafka uses murmur2(record.key())) % num partitions algorithm by default. The calculations above is just an example.
Let's say in Kafka I have 4 partitions of a topic 'A' and I have 20 consumers of Consumer Group 'AC'. I don't need any ordering, but I want to process the messages faster by scaling my consumer instances. Please note all messages are independent and can be processed independently.
I looked at a consumer configuration partition.assignment.strategy, but not sure if I can achieve dynamic assignment of consumer to partition, depending on the message availability.
One partition is assigned to exactly one consumer in the group. In your case you have only 4 consumers on 20 which are currently working. You have to increase partitions number if you want more assigned consumers.
I'm confused to what degree partition assignment is a client side concern partition.assignment.strategy and what part is handled by Kafka.
For example, say I have one kafka topic with 100 partitions.
If I make 1 app that runs 5 threads of consumers, with a partition.assignment.strategy of RangeAssignor then I should get 5 consumers each consuming 25 partitions.
Now if I scale this app by deploying it 4 times, and using the same consumer group. Will kafka first divide 25 partitions to each of these apps on its side, and only then are these 25 partitions further subdivided by the app using the PartitionStrategy?
Which would result neatly in 4 apps with 5 consumers each, consuming 5 partitions each.
The behavior of the default Assignors is well documented in the Javadocs.
RangeAssignor is the default Assignor, see its Javadoc for example of assignment it generates: http://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/RangeAssignor.html
If you have 20 consumers using RangeAssignor that are consuming from a topic with 100 partitions, each consumer will be assigned 5 partitions.
Because RangeAssignor assigns partitions topic by topic, it can create really unbalanced assignments if you have topics with very few partitions. In that case, RoundRobinAssignor works better
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if any one of the following events are triggered:
Number of partitions change for any of the subscribed topics
A subscribed topic is created or deleted
An existing member of the consumer group is shutdown or fails.
A new member is added to the consumer group.
Most likely point no. 4 is your case and the strategy used will be the same(partition.assignment.strategy). Not that this is not applicable if you have explicitly specified the partition to be consumed by your consumer
Supposing we have 4 consumers with the same group-id and one topic including 3 partitions. If a producer post a message to partition-1, then which consumer will get this message?
I'll give a bit more detailed answer here.
So the first point is that there is no reason to have more consumer threads (and each consumer has at least 1 consumer thread) than the number of partitions being consumed. The reason is that if you have more consumer threads than partitions, some consumer threads will just end up being idle and will just waste resources. So given the example you attached there is no point of having 4 consumers for 3 partitions.
The second point - the partition assignment depends on the strategy chosen by consumers in the group. Currently there are 2 partition assignment strategies - Range and RoundRobin. If you are using the Range strategy you can predict what partitions will be consumed by each consumer after rebalance. With RoundRobin strategy though you can't predict beforehand the partition assignments for consumers after rebalance.
The detailed answer that explains how consumer rebalancing works and how partitions are assigned is here.
You can also view current partition assignments for your consumer group in Zookeeper at /consumers/[group_id]/owners/[topic]/[partition]