How should I know when i have to scale the consumer in consumer group . What are the triggers for the consumers to scale when there is a fast producer ?
One straight forward approach would be to get the consumer lag(this can be computed as the difference between committed offset and beginning_offset) and if the lag computed in the last n times is increasing you can scale up and vice versa. You might've to consider some edge cases for example in case consumers have gone down and lag would be increasing and the auto-scaling function might spawn more threads/machines).
In Kafka while creating a topic, need to provide number of partitions and replication factor.
Let say there is one topic called TEST with 10 partitions, for parallel consumption of data need to create consumer group with 10 consumers, where each consumer will be consuming the data from the respective partition.
Here is the catch, if the topic is having 10 partitions and consumer group is having 12 consumers then two consumer remain idle until one of the consumer dies.
if the topic is having 10 partitions and consumer group has 8 consumers then 6 consumers will consume the data from 6 partitions (one consumer->one partition) whereas remaining two consumers will be responsible for consuming the data from two partitions (one consumer-> 2 partitions). its means last two-consumers consumes the data from four partitions.
Hence first thing is to decide number of partition for your kafka topic, more partitions means more parallelism.
whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka.
Actually auto-scale is not a good idea because in Kafka message order is guaranteed in partition.
From Kafka docs:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent
by the same producer as a record M2, and M1 is sent first, then M1
will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
If you add more partitions and more consumers with respect to number of partitions, then you cannot satisfy ordering guarantee of messages.
Suppose that you have 10 partitions and your number of key is 102, then this message will be sent to partition: 102 % 10 = 2
But if you increase number of partitions to 15 for instance, then messages with same key (102) will be sent to a different partition: 102 % 15 = 12
As you see with this approach it is impossible to guarantee ordering of the messages with same keys.
Note: By the way Kafka uses murmur2(record.key())) % num partitions algorithm by default. The calculations above is just an example.
Related
I have a kakfa topic with 20 partitions and 5 conusmers belonging to the same consumer group. It means that we have 4 partitions per consumer. Lets say:
consumer-0 is assigned to the partition-0, partition-1, partition-2 and partition-3
consumer-1 is assigned to the partition-4, partition-5, partition-6 and partition-7
consumer-2 is assigned to the partition-8, partition-9, partition-10 and partition-11
consumer-5 is assigned to the partition-12, partition-13, partition-14 and partition-15
consumer-4 is assigned to the partition-16, partition-17, partition-18 and partition-19
The producer evenly send 10 messages to the topic. In this case, only partitions 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9 are going to receive messages. The remaining ones will be empty.
Our problem is that consumer-0 and consumer-1 will process 4 messages and in the same time, consumer-2 will process two messages. Also, consumer 4 and 5 will do any treatement since their partitions are idle.
At the producer side, we are working with the DefaultPartitioner (kafka-client 2.3.1) so that the record are evenly sent to the partitions. We would like to ask if it is possible to produce messages fairly based on kafka consumer rather than partitions. With this manner, each consumer will process only two messages and the process complexity will be fairly distributed between consumers.
I think the calculations you made are non-relevant, because there's no scenario only 10 messages will be sent, and if this is really the situation you should consider using less partitions and relatively less consumers in the consumer group.
You can assume that for larger number of records in the stream, your producer will distribute the load roughly evenly between partitions and therefore between consumers, and now you don't care if consumer-1 received 1000 records and consumer-2 received 998.
Remember also that if the loads are changing, and for lower phases you don't won't consumers to be idle but to handle the same loads, this is completely OK that some consumers gets 4 messages, others 2, and others 0, because processing 4 messages is basically being kind of "idle" in relation to the loads you are expecting, and these differences are so minor they doesn't really count; so let Kafka do the magic for the higher loads when process power/time really matters.
In general, I do not think this is a good design trying to force a producer to partition the data based on the consumer. A Kafka topic should seperate the dependencies between a producer and a consumer and encapsulate them from each other.
Two main reasons to not try to achieve this:
a Kafka topic is meant to be consumed by multiple consumer groups and they are (hopefully) all independent of each other in terms of consumer threads.
a consumer group and its consumers is not stable as one of them could die and a rebalance could happen. It is then required to have a sticky partition assignment strategy that adds more conplexity to your consumer. However, what if one of the 5 consumers dies forever? You would not be able to read the message of its four partitions. Remember a consumer group is a "moving thing" and I recommend to let Kafka habdle it as much as possible.
I understand this might not actually answer your question. If you want proper balancing you should match the number of partition with consumer threads and ensure on the producer side that all messages are produced in a balanced way accross the partitions.
Remember that even when using the DefaultPartitioner with as many topics as 20 you can still end up producing the data unbalanced as it depends oh the hash value of your keys.
If a producer has 3 topics and 4 partitions each topic, should the consumer group contains 4 or 12 consumers?
I want to achieve ideal consumption.
There should be one consumer each partition for ideal consumption. So, for your case, 12 consumers should be ideal.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
You cannot have multiple consumers -within the same consumer group- consuming data from a single partition. Therefore, in order to consume data from the same partition using N consumers, you'd need to create N distinct consumer groups too.
Note that partitioning enhances the parallelism within a Kafka cluster. If you create thousands of consumers to consume data from only one partition, I suspect that you will lose some level of parallelism.
If you have 3 topics with 4 partition each.
For best optimisation you should have 4 consumers per consumer group.
Reason : If you have more than 4 consumers ,your extra consumers would be left ideal, because 4 consumers will be assigned 4 partitions with 1 consumer assigned 1 partition. So in short more than 4 consumers is not required per consumer group.
If you have less consumers say 2 consumers for 4 topics , each consumer will consume messages from 2 partitions each which will overload it.
There is no limit in number of consumer groups which subscribe to a topic.
I'm confused to what degree partition assignment is a client side concern partition.assignment.strategy and what part is handled by Kafka.
For example, say I have one kafka topic with 100 partitions.
If I make 1 app that runs 5 threads of consumers, with a partition.assignment.strategy of RangeAssignor then I should get 5 consumers each consuming 25 partitions.
Now if I scale this app by deploying it 4 times, and using the same consumer group. Will kafka first divide 25 partitions to each of these apps on its side, and only then are these 25 partitions further subdivided by the app using the PartitionStrategy?
Which would result neatly in 4 apps with 5 consumers each, consuming 5 partitions each.
The behavior of the default Assignors is well documented in the Javadocs.
RangeAssignor is the default Assignor, see its Javadoc for example of assignment it generates: http://kafka.apache.org/21/javadoc/org/apache/kafka/clients/consumer/RangeAssignor.html
If you have 20 consumers using RangeAssignor that are consuming from a topic with 100 partitions, each consumer will be assigned 5 partitions.
Because RangeAssignor assigns partitions topic by topic, it can create really unbalanced assignments if you have topics with very few partitions. In that case, RoundRobinAssignor works better
As part of group management, the consumer will keep track of the list of consumers that belong to a particular group and will trigger a rebalance operation if any one of the following events are triggered:
Number of partitions change for any of the subscribed topics
A subscribed topic is created or deleted
An existing member of the consumer group is shutdown or fails.
A new member is added to the consumer group.
Most likely point no. 4 is your case and the strategy used will be the same(partition.assignment.strategy). Not that this is not applicable if you have explicitly specified the partition to be consumed by your consumer
I am newbie in Kafka world and was reading about Consumer and ConsumerGroup.I got the difference between them and understand why we need ConsumerGroup in Kafka.
But here my question is When we should decide when to create new Consumer within same Group.
When we have huge amount of data?
Could someone help me to understand any real use case.
Thanks
I think some very good points have already been mentioned and here are my few cents. As your primary question seems to be "When" to add a consumer in a group...
There are 2 scenarios I could think of:
If one or more consumers in a Consumer group are overloaded by consumption from multiple partitions and you intend to distribute that load and increase parallelism. In this case, you could add consumers and trigger a rebalance.
If the partitions in a topic are increasing. This is quite a tricky scenario and may disturb the existing consumers in some ways. Following are a few examples of when this might happen:
a) If the semantics of your data are changing as partitioning a topic
based on the semantics is quite a common use case
b) If the data volume is increasing and the semantics are also changing
c) If only the volume is increasing that is leading to Scenario 1
However, as you've pointed out in your question - if only the volume is increasing and the consumers in a group are nicely mapped to the partitions on a 1-to-1 basis then you may be better off leaving things as they are. Otherwise, you might end up in the Scenario 2b.
Hope this helps!
In Apache Kafka, the level of parallelism is defined by the number of partitions. The higher the number of partitions, the higher the level of parallelism one can achieve. Depending on the volume of data, you should set the number of partitions to the desired value. Note that you can not have more active consumers than number of partitions.
For example, assume that you have a topic test with 5 partitions and a consumer group test-group. At any given time, only 5 consumers can be active withing test-group. Say we've got 1000 messages in topic test, then each of the 5 active consumers will consume (approximately) 200 messages. In case you run more than 5 partitions, the remaining will be inactive meaning that they won't consumer any messages at all. Similarly, if you have less consumers than partitions, then some of your active consumers will consumer messages from more than one partition.
Another -less straight-forward- example would be the following (taken from):
In this scenario, we do have two topics (A and B), each of which has 3 partitions. Two consumers belonging to the same consumer group are consuming messages from both topics.
As mentioned above, Kafka scales the topic consumption by distributing partitions among a consumer group. A consumer group is nothing, but a set of consumers sharing the common identifier.
A consumer is responsible to consumer messages from one or more partitions. If there is a single consumer running in the consumer group, it will consume data from all partitions. If there are multiple consumers running with in same group, they distribute the load in consumes from different-different partitions.
Maximum number of consumers are equal to the maximum number of partitions. If the consumers number exceeds than number of partitions, excessive consumers will be idle.
Let's say if there is a topic with 4 partitions. There are two consumer groups A and B. Group A has two consumers C1,C2. Both consumers will consume from approx 2 and 2 partitions.
While in Consumer Group B, there are 4 consumers, each consumer will consume from one partition.
When to use single consumer or multiple consumer : It depends on the use case. If you want a consolidated output from the processing where the calculations are based on the entire data in the topic, you should use single consumer unless you have a post processing logic to merge the output from each consumer.
If you are just reading the data and want to parallelize the process by distributing load, use multiple consumers
Can you explain how kafka partitions works for this scenario
If i produce 9 (1-9) messages round robin with 1 topic & 3 partitions.
Does it means that:
Partition 1 contains: [1,4,7]
Partition 2 contains: [2,5,8]
Partition 3 contains: [3,6,9]
?
Also how many consumers can get all the data 3? why?
Can you explain?
I guess also that consumer group can solve it but not sure why
Can you explain how kafka partitions works for this scenario
Your understanding is correct.
Also how many consumers can get all the data 3? why?
Depends on how many consumers you have in your consumer group.
If you only have 1 consumer in a group, it will get all the messages from all partitions.
If you have 2 consumers in a group, each will claim a subset of the partitions, e.g. 1st consumer will get all messages from partitions 1 and 2 and the 2nd consumer will get messages from partition 3.
If you have 3 consumers in a group, each will get one partition assigned.
If you have more than 3 consumers in a group, 3 consumers will get one partition each and the remaining consumers will not get any messages, just act as redundancy in case of failover.
The distribution of messages in the partitions is correct if and only if you publish messages without keys. In Kafka it is common to publish messages as (Key, Value) pairs and if you produce messages this way then the default partitioner will ensure that all messages of the same key will get put in the same partition. It does this by using a hashing function on each of the keys that maps to one of the available partitions. In the extreme case where all your messages have the same key, then they would all go to the same partition. If your messages all had either a string key "foo" or a key called "bar" then all the messages with key "foo" may go to partition 3 and all the messages with key "bar" may go to partition 1.
In terms of your question about consumers, you can have an unlimited number of consumers. If each consumer has a unique group.id then they are considered independent and they will each get their own full set of the messages from all partitions.
However if you have consumers that share the same group.id then they are said to be in a consumer group and each will get an exclusive and roughly equal subset of the partitions. If you had 3 consumers in the same group they would get 1 partition each. If you added any more than 3 consumers in the same group then the first 3 will get 1 partition each and all the others will be standby consumers than only become active if one of the 3 active consumers leaves the group.
The distribution of the messages through the partitions is correct in the idea. The partitions are the paralelism unit of Kafka.
You can have 3 consumers which will each handle one partition, but you can also have only 1 consumer which will get the data from the 3 partitions. It depends on the throughput you can have/want for each consumer.
Concerning the consumer groups :
If all your consumers have the same consumer group, the messages will be load balanced over the consumers
If your consumers have different consumer groups, then each messages will be broadcast to all consumer processes
FYI : the messages order is only kept within a partition, that is why messages coming from different partitions could be unordered.