Apache Kafka Consumer Election - apache-kafka

Lets say we have two consumers for a topic with one partition. At first, first consumes messages from the topic and second remains idle. If first fails, second takes over and starts consuming the messages.
When the first again comes alive, will it again start consuming messages and make the second idle?
How to achieve this?

Consumers are a part of a consumer group, defined by the consumer's group.id. For n partitions, the max number of active consumers in a consumer group is n. You can have more, but they will be idle.
For example, imagine a topic with 6 partitions. If you have 6
consumers in a consumer group, each consumer will read from 1
partition. If you have 12, six of the consumers will remain idle while
the other six consume from 1 partition. If you have 3 consumers, each
consumer will read from 2 partitions.
In your case, for a topic with 1 partition, only 1 consumer for each consumer group can be actively consuming at a time. If you have 2 consumers in your consumer group, then consumer-1 will consume all messages from the single partition. If that consumer fails, consumer-2 will start consuming at the last known offset of consumer-1. If consumer-1 comes back online, it will remain idle until consumer-2 fails. All consumers are treated equally.

Related

Number of consumers in kafka comsumer-group

If a producer has 3 topics and 4 partitions each topic, should the consumer group contains 4 or 12 consumers?
I want to achieve ideal consumption.
There should be one consumer each partition for ideal consumption. So, for your case, 12 consumers should be ideal.
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.
You cannot have multiple consumers -within the same consumer group- consuming data from a single partition. Therefore, in order to consume data from the same partition using N consumers, you'd need to create N distinct consumer groups too.
Note that partitioning enhances the parallelism within a Kafka cluster. If you create thousands of consumers to consume data from only one partition, I suspect that you will lose some level of parallelism.
If you have 3 topics with 4 partition each.
For best optimisation you should have 4 consumers per consumer group.
Reason : If you have more than 4 consumers ,your extra consumers would be left ideal, because 4 consumers will be assigned 4 partitions with 1 consumer assigned 1 partition. So in short more than 4 consumers is not required per consumer group.
If you have less consumers say 2 consumers for 4 topics , each consumer will consume messages from 2 partitions each which will overload it.
There is no limit in number of consumer groups which subscribe to a topic.

How to dynamically add consumers in consumer group kafka

How should I know when i have to scale the consumer in consumer group . What are the triggers for the consumers to scale when there is a fast producer ?
One straight forward approach would be to get the consumer lag(this can be computed as the difference between committed offset and beginning_offset) and if the lag computed in the last n times is increasing you can scale up and vice versa. You might've to consider some edge cases for example in case consumers have gone down and lag would be increasing and the auto-scaling function might spawn more threads/machines).
In Kafka while creating a topic, need to provide number of partitions and replication factor.
Let say there is one topic called TEST with 10 partitions, for parallel consumption of data need to create consumer group with 10 consumers, where each consumer will be consuming the data from the respective partition.
Here is the catch, if the topic is having 10 partitions and consumer group is having 12 consumers then two consumer remain idle until one of the consumer dies.
if the topic is having 10 partitions and consumer group has 8 consumers then 6 consumers will consume the data from 6 partitions (one consumer->one partition) whereas remaining two consumers will be responsible for consuming the data from two partitions (one consumer-> 2 partitions). its means last two-consumers consumes the data from four partitions.
Hence first thing is to decide number of partition for your kafka topic, more partitions means more parallelism.
whenever any new consumer is added or removed to the consumer group rebalacing is taken care by kafka.
Actually auto-scale is not a good idea because in Kafka message order is guaranteed in partition.
From Kafka docs:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent
by the same producer as a record M2, and M1 is sent first, then M1
will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees records in the order they are stored in the log.
If you add more partitions and more consumers with respect to number of partitions, then you cannot satisfy ordering guarantee of messages.
Suppose that you have 10 partitions and your number of key is 102, then this message will be sent to partition: 102 % 10 = 2
But if you increase number of partitions to 15 for instance, then messages with same key (102) will be sent to a different partition: 102 % 15 = 12
As you see with this approach it is impossible to guarantee ordering of the messages with same keys.
Note: By the way Kafka uses murmur2(record.key())) % num partitions algorithm by default. The calculations above is just an example.

Kafka consumer in group skips the partitions

I've a single consumer which consumes a topic. Topic has 6 partitions. Single consumer assigned to the group.
I do poll like below
Consumer.poll(10000)
I exit the consumer fetch when no records return.
From the documentation I believe poll return empty when no records to consume and duration 10000 is enough to rebalance and fetch records.
Most of the times poll consumes records from all partions but some times poll fetched record from 3 partitions and return empty records with out consuming other 3 partitons.
BTW, I used 2.0.1 Kafka client and Kafka server version is 2.11 - 2.2.0.
Any one have idea why my consumer skipping other partitions and return empty records.what should I do to consume all partitions.
max.poll.records parameter is 500 in default. So sometimes it's possible to not be able to get all messages from all partitions in the topic with one poll().
max.poll.records: The maximum number of records returned in a single
call to poll().
By the way having just one consumer in group is not appropriate way to consume a topic with partitions. Your number of consumers in consumer group should be equals to number of partitions in topic subscribed in best practice. (Kafka assigns partitions to consumers evenly by default) Otherwise you cannot scale load horizontally, and having partitions is not so meaningful in that case.
Kafka always assigns partitions to consumers. It is not possible to have a partition which is not assigned to a consumer. (If this topic is subscribed)
But in your case because you exit consumer it takes some time (session.timeout.ms) to consider this consumer as dead by Kafka. If you start the consumer again without waiting session.timeout.ms to pass, then Kafka realizes that there is two active consumers in consumer group and assigns partitions evenly to this two consumers. (like: partitions 0, 1, 2 to consumer-1 and partitions 3, 4, 5 to consumer-2) But after Kafka realizes that one of the consumer is dead, rebalance is started in the consumer group and all partitions are assigned to one active consumer in consumer group.
session.timeout.ms: The timeout used to detect client failures when
using Kafka's group management facility. The client sends periodic
heartbeats to indicate its liveness to the broker. If no heartbeats
are received by the broker before the expiration of this session
timeout, then the broker will remove this client from the group and
initiate a rebalance. Note that the value must be in the allowable
range as configured in the broker configuration by
group.min.session.timeout.ms and group.max.session.timeout.ms
You can check current partition assignment for your consumer-group with this cli command in broker side:
./kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group yourConsumerGroup

Kafka what will happen with message if a consumer group member goes down?

Suppose we have a topic named Topic and a consumer group CG with three consumers. The group offset equals to 0.
The consumers start to read messages.
Read sequence:
Consumer 1 reads message 1.
Consumer 2 reads message 2.
Consumer 3 reads message 3.
Consumer 2 commits message 2.
Consumer 3 commits message 3.
Consumer 1 fails on processing of message 1 and goes down.
The question is: what will happen with message 1?
Or maybe I understand wrongly how consumers do read the messages as I am new to Kafka.
Version: Apache Kafka 2.4.0
You missed about partitions, Multiple consumers of same group will never consume messages from same partition
Suppose if you have the topic with three partitions(P0,P1,P2),and if you have three consumers(C1,C2,C3) of same group then each one will start consuming from each partition
If any consumer fails to submit offset and go down, then it will again start consuming the messages from previous offset (in your case 0)
Suppose if you have topic of 5 partition (P0,P1,P2,P3,P4) and of three consumers (C0,C1,C3) of same group. Then consumers will try to load balance equally by each taking two partition
C1 consumes from P0 and P1, and C2 consumes from P2 and P3 and remaining C3 consumes from P4.
Each consumer will have partitions assigned to them.
Let's say we have 6 partitions:
Consumer 1: Partiton 1 & 2
Consumer 2: Partition 3 & 4
Consumer 3: Partition 5 & 6
When Consumer 1 goes down, the consumer group will re-balance and assign the free partitions to the other consumers, thus giving us the following setup:
Consumer 2: Partition 1 & 3 & 4
Consumer 3: Partition 2 & 5 & 6
The other consumers will start off from the last committed offset on that partition.

Kafka Consumer from different group consuming from different partition of Topic

I have a scenario where I have deployed 4 instances of Kafka Consumer on different nodes. My topic has 4 partitions. Now, I want to configure the Consumers in such a way that they all fetch from different partitions of the topic.
I know for a fact that if the Consumers are from the same consumer group, they ensure that the partitions are split equally. But in my case, they are not in the same group.
In order to achieve what you want you need the consumers being in the same consumer group. Only in this case a "competing consumer" pattern is applied : each consumer receives 1 partition from the 4, so you have 4 consumers each one reading from 1 partition and receiving messages for that partitions.
When consumers are part of different consumer groups, each consumer will be assigned to all 4 partitions receiving messages from all of them in a publish/subscribe way.