Kafka PartitionStrategy + horizontal scaling - apache-kafka

I'm confused to what degree partition assignment is a client side concern partition.assignment.strategy and what part is handled by Kafka.
For example, say I have one kafka topic with 100 partitions.
If I make 1 app that runs 5 threads of consumers, with a partition.assignment.strategy of RangeAssignor then I should get 5 consumers each consuming 25 partitions.
Now if I scale this app by deploying it 4 times, and using the same consumer group. Will kafka first divide 25 partitions to each of these apps on its side, and only then are these 25 partitions further subdivided by the app using the PartitionStrategy?
Which would result neatly in 4 apps with 5 consumers each, consuming 5 partitions each.

The behavior of the default Assignors is well documented in the Javadocs.
RangeAssignor is the default Assignor, see its Javadoc for example of assignment it generates: http://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/RangeAssignor.html
If you have 20 consumers using RangeAssignor that are consuming from a topic with 100 partitions, each consumer will be assigned 5 partitions.
Because RangeAssignor assigns partitions topic by topic, it can create really unbalanced assignments if you have topics with very few partitions. In that case, RoundRobinAssignor works better

As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if any one of the following events are triggered:
Number of partitions change for any of the subscribed topics
A subscribed topic is created or deleted
An existing member of the consumer group is shutdown or fails.
A new member is added to the consumer group.
Most likely point no. 4 is your case and the strategy used will be the same(partition.assignment.strategy). Not that this is not applicable if you have explicitly specified the partition to be consumed by your consumer

Related

consume kafka topics with different partition numbers

Hi i have a kafka consumer (using spring kafka dependency) that listens to multiple topics. Lets say i have 3 topics which are topicA, topicB and topicC. In my application i consume all three topics in one consumer like below.
#KafkaListener(topics = "topicA,topicB,topicC", groupId = "myGroup", concurrency="3")
My topics have partitions and those number of partitions are deferent from each. Lets say my topicA has 3 partitions. topicB have 6 partitions and topicC has 9 partitions. How should i determine a number for "concurrency" option in #KafkaListener. (I'm confused since topicB and topicC contain 6 and 9 partitions respectively. So should i change the concurrency to 6 or 9 ? or should i change it to 18 which is total number of partitions from 3 topics)
I know that on the consumer side, Kafka always gives a single partition’s data to one consumer thread and the degree of parallelism in the consumer (within a consumer group) is bounded by the number of partitions being consumed.
My main goal is to consume parallelly by using concurrency option in #kafkalistener
If you set the concurrency to 18, with the default partition assignor, if the concurrency is greater than the number of partitions, you will have idle consumers. The partitions from different topics have no bearing on how the partitions are distributed.
You can use a custom partition assignor (in the consumer configuration) to distribute the partitions differently.
See https://kafka.apache.org/documentation/#consumerconfigs_partition.assignment.strategy
Also see the discussion about RoundRobinAssignor here https://docs.spring.io/spring-kafka/docs/current/reference/html/#using-ConcurrentMessageListenerContainer
Or, simply add 3 separate #KafkaListener annotations to the method, one for each topic, with different concurrencies.

Does Kafka rebalancing algorithm balance across topics?

Does Kafka rebalancing algorithm work across topics?
Suppose I have 5 topics, each with 10 partitions, and 20 instances of consumer application in the same consumer group subscribing each to these 5 topics.
Will Kafka try to balance 50 partitions evenly across 20 instances?
Or will it balance only within a topic, and thus 10 first instances may (or likely to) receive all 50 partitions, while 10 other instances may stay idle?
I know that in older days Kafka did not balance across topics, but what about current versions?
The assignment of consumer instances to partitions depends on the Consumer Configuration partition.assignment.strategy. Its default value is class org.apache.kafka.clients.consumer.RangeAssignor but you can also select RoundRobinAssignor, StickyAssignor or you can even build your own strategy by extending the abstract class AbstractPartitionAssignor.
I think for your case the RoundRobin assignment strategy would lead to a more balanced asignment. The difference between the strategies Range and RoundRobin are depicted in the diagram below.
In your case (having 10 partitions in each topic and 20 consumer instances) the Range strategy would lead to 10 instances being idle. However, using the RoundRobin strategy would keep all instances busy as it follows the principle: The partitions will be uniformly distributed in that the largest
difference between assignments should be one partition.
Please note that consumer assignment to topic partitions is different to a Rebalance. A Rebalance is initiated when
A consumer leave the Consumer Group (eg.g by failing to send a heartbeat or by explicitly requesting to leave)
A new consumer joins the ConsumerGroup
A consumer changes its topic subscriptions
a change in the subscribed topic such as increase/decrease of partitions.
During a rebalance the consumption is paused for the entire consumerGroup and the assignment is happening again based on your selected strategy.
You can choose RoundRobin as partition assignor instead of default Range assignment to get all instances consuming.
Range Assignor:
Range assignor works on each topic, and it will divide partitions into several ranges based on the total number of consumer. Then all consumers will be sorted by lexicographic order and each consumer will take a range of partitions.
For you case, you have 10 partitions for each topics and total 20 consumers. Then coordinator will assign 1 partition for each of first 10 consumers. In this case, you will get 10 idle consumers.
And the same thing happens for each topic, so you will get first 10 consumers has been assigned 5 partitions(1 for each topic) and other 10 will be idle.
Round-Robin Assignor:
Round-Robin assignor will list all partitions for all topics subscribed by consumer group. And each consumer will take partitions round-robin.
For you case, coordinator will list all partitions like:
t1p1, t1p2, t1p3 ... t5p9, t5p10
And all 20 consumers will take partitions in this order, so finally you will get:
Consumer1: t1p1, t3p1, t5p1
Consumer2: t1p2, t3p2, t5p2
.
.
.
Consumer 10: t2p10, t4p10
It could be more balanced than Range Assignor.

How to dynamically add consumers in consumer group kafka

How should I know when i have to scale the consumer in consumer group . What are the triggers for the consumers to scale when there is a fast producer ?
One straight forward approach would be to get the consumer lag(this can be computed as the difference between committed offset and beginning_offset) and if the lag computed in the last n times is increasing you can scale up and vice versa. You might've to consider some edge cases for example in case consumers have gone down and lag would be increasing and the auto-scaling function might spawn more threads/machines).
In Kafka while creating a topic, need to provide number of partitions and replication factor.
Let say there is one topic called TEST with 10 partitions, for parallel consumption of data need to create consumer group with 10 consumers, where each consumer will be consuming the data from the respective partition.
Here is the catch, if the topic is having 10 partitions and consumer group is having 12 consumers then two consumer remain idle until one of the consumer dies.
if the topic is having 10 partitions and consumer group has 8 consumers then 6 consumers will consume the data from 6 partitions (one consumer->one partition) whereas remaining two consumers will be responsible for consuming the data from two partitions (one consumer-> 2 partitions). its means last two-consumers consumes the data from four partitions.
Hence first thing is to decide number of partition for your kafka topic, more partitions means more parallelism.
whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka.
Actually auto-scale is not a good idea because in Kafka message order is guaranteed in partition.
From Kafka docs:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent
by the same producer as a record M2, and M1 is sent first, then M1
will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
If you add more partitions and more consumers with respect to number of partitions, then you cannot satisfy ordering guarantee of messages.
Suppose that you have 10 partitions and your number of key is 102, then this message will be sent to partition: 102 % 10 = 2
But if you increase number of partitions to 15 for instance, then messages with same key (102) will be sent to a different partition: 102 % 15 = 12
As you see with this approach it is impossible to guarantee ordering of the messages with same keys.
Note: By the way Kafka uses murmur2(record.key())) % num partitions algorithm by default. The calculations above is just an example.

When to create new Consumer in ConsumerGroup

I am newbie in Kafka world and was reading about Consumer and ConsumerGroup.I got the difference between them and understand why we need ConsumerGroup in Kafka.
But here my question is When we should decide when to create new Consumer within same Group.
When we have huge amount of data?
Could someone help me to understand any real use case.
Thanks
I think some very good points have already been mentioned and here are my few cents. As your primary question seems to be "When" to add a consumer in a group...
There are 2 scenarios I could think of:
If one or more consumers in a Consumer group are overloaded by consumption from multiple partitions and you intend to distribute that load and increase parallelism. In this case, you could add consumers and trigger a rebalance.
If the partitions in a topic are increasing. This is quite a tricky scenario and may disturb the existing consumers in some ways. Following are a few examples of when this might happen:
a) If the semantics of your data are changing as partitioning a topic
based on the semantics is quite a common use case
b) If the data volume is increasing and the semantics are also changing
c) If only the volume is increasing that is leading to Scenario 1
However, as you've pointed out in your question - if only the volume is increasing and the consumers in a group are nicely mapped to the partitions on a 1-to-1 basis then you may be better off leaving things as they are. Otherwise, you might end up in the Scenario 2b.
Hope this helps!
In Apache Kafka, the level of parallelism is defined by the number of partitions. The higher the number of partitions, the higher the level of parallelism one can achieve. Depending on the volume of data, you should set the number of partitions to the desired value. Note that you can not have more active consumers than number of partitions.
For example, assume that you have a topic test with 5 partitions and a consumer group test-group. At any given time, only 5 consumers can be active withing test-group. Say we've got 1000 messages in topic test, then each of the 5 active consumers will consume (approximately) 200 messages. In case you run more than 5 partitions, the remaining will be inactive meaning that they won't consumer any messages at all. Similarly, if you have less consumers than partitions, then some of your active consumers will consumer messages from more than one partition.
Another -less straight-forward- example would be the following (taken from):
In this scenario, we do have two topics (A and B), each of which has 3 partitions. Two consumers belonging to the same consumer group are consuming messages from both topics.
As mentioned above, Kafka scales the topic consumption by distributing partitions among a consumer group. A consumer group is nothing, but a set of consumers sharing the common identifier.
A consumer is responsible to consumer messages from one or more partitions. If there is a single consumer running in the consumer group, it will consume data from all partitions. If there are multiple consumers running with in same group, they distribute the load in consumes from different-different partitions.
Maximum number of consumers are equal to the maximum number of partitions. If the consumers number exceeds than number of partitions, excessive consumers will be idle.
Let's say if there is a topic with 4 partitions. There are two consumer groups A and B. Group A has two consumers C1,C2. Both consumers will consume from approx 2 and 2 partitions.
While in Consumer Group B, there are 4 consumers, each consumer will consume from one partition.
When to use single consumer or multiple consumer : It depends on the use case. If you want a consolidated output from the processing where the calculations are based on the entire data in the topic, you should use single consumer unless you have a post processing logic to merge the output from each consumer.
If you are just reading the data and want to parallelize the process by distributing load, use multiple consumers

Kafka Consumer distribution not working as expected

I have Three topics each having three partitions on a cluster of kafka.
now, there are total 9 partitions. and when i create 9 consumers... the 6 are being idle. only three consumers are being used.
the expectation is: each consumer should pickup one partitions and hence, 9 consumer should pick up documents from 9 partitions
but what happens is:
one consumer picks up messages from three paritions one of different topic.
e.g. i have three topics Topic_A,Topic_B and Topic_C and three partitions each. hence parititions are as below:
Topic_A_0, Topic_A_1, Topic_A_2, Topic_B_0, Topic_B_1, Topic_B_2,
Topic_C_0, Topic_C_1, Topic_C_2
When i create 9 consumers,
the distribution works as below:
Consumer1: Topic_A_0,Topic_B_0,Topic_C_0
Consumer2: Topic_A_1,Topic_B_1,Topic_C_1
Consumer3: Topic_A_2,Topic_B_2,Topic_C_2
Consumer4,Consumer5,Consumer6,Consumer7,Consumer8,Consumer9 are idle
It should be
Consumer1: Topic_A_0
Consumer2: Topic_A_1
Consumer3: Topic_A_2
Consumer4: Topic_B_0
Consumer5: Topic_B_1
Consumer6: Topic_B_2
Consumer7: Topic_C_0
Consumer8: Topic_C_1
Consumer9: Topic_C_2
Is there any configuration i need to let all 9 consumer pick up messages from 9 unique parititons?
Make sure your all your consumers are subscribing to same set of topics under the same consumer group id. For the list of topics, you can pass a predefined list or a regular expression for consumers to subscribe from. The consumer-id can be set using group.id property in consumer.
The default partition assignment strategy doesn't work across topics so this is the expected behaviour. A similar question here : Kafka Consumers are balanced across topics