How multiple consumer group consumers work across partition on the same topic in Kafka? - apache-kafka

I was reading this SO answer and many such blogs.
What I know:
Multiple consumers can run on a single partition when running multiple consumers with multiple consumer group id and only one consumer from a consumer group can consume at a given time from a partition.
My question is related to multiple consumers from multiple consumer groups consuming from the same topic:
What happens in the case of multiple consumers(different groups) consuming a single topic(eventually the same partition)?
Do they get the same data?
How offset is managed? Is it separate for each consumer?
(Might be opinion based) How do you or generally recommended way is to handle overlapping data across two consumers of a separate group operating on a single partition?
Edit:
"overlapping data": means two consumers of separate consumer groups operating on the same partition getting the same data.

Yes they get the same data. Kafka only stores one copy of the data in the topic partitions' commit log. If consumers are not in the same group then they can each get the same data using fetch requests from the clients' consumer library. The assignment of which partitions each group member will get is managed by the lead consumer of each group. The entire process in detailed steps is documented here https://community.hortonworks.com/articles/72378/understanding-kafka-consumer-partition-assignment.html
Offsets are "managed" by the consumers, but "stored" in a special __consumer_offsets topic on the Kafka brokers.
Offsets are stored for each (consumer group, topic, partition) tuple. This combination is also used as the key when publishing offsets to the __consumer_offsets topic so that log compaction can delete old unneeded offset commit messages and so that all offsets for the same (consumer group, topic, partition) tuple are stored in the same partition of the __consumer_offsets topic (which defaults to 50 partitions)

Each consumer group gets every message from a subscribed topic.
Yes
Offset are stored by partition. For example let's say you have a topic with 2 partitions and a consumer group named cg made up of 2 consumers. In that case Kafka assigns each of the consumers one of the partitions. Then the consumers fetch the offset for the partition they were assigned to from Kafka (e.g. consumer 'asks' Kafka: "What is the offset for this topic for consumer group cg partition 1", or partition 2 for the other consumer). After getting the correct offset the consumer polls some Kafka broker for the next message in that partition.
I'm not entirely sure what you mean by overlapping data, can you clarify a bit or give an example?

Related

Will consumers in a group subscribe to one topic each, if there is a single partition per topic?

I'm using Debezium to log changes in my database, and Debezium generates change events in a topic for each table that exists in my database.These change records are consumed to populate another database.
If I restrict each topic to only have 1 partition, and let's say I have 4 consumers running, when the consumers subscribe to topics, will the 4 consumers divide the topics among themselves? (they would distribute partitions among themselves, but here 1 topic = 1 partition)
Would the above setup mean that for each table, the generated events on the topic will always be executed in order because there's at most 1 consumer acting on the topic at the same time?
I'm currently trying to get Kafka to restrict 1 partition per topic, and have 2 consumers. But the 2 consumers seem to pick up different topics every now and then and not be 'sticky' to the topics.
Yes, if there are 4 topics, 1 partition in each topic and 4 consumers, the consumers will evenly distribute the partitions among themselves which will result in 1 topic per consumer.
However, to get a "sticky assignment" you would need to give the consumer groups static group IDs. Otherwise, when there are failures and a rebalance is triggered, the consumer can be assigned a different partition (in this case a different topic).

Is Consumer Offset managed at Consumer group level or at the individual consumer inside that consumer group?

I am trying to find out is there any offsets at consumer group level as well. Is Consumer Offset is at Consumer group level or at the individual consumer inside that consumer group in Kafka ?
Offsets are tracked at ConsumerGroup level.
Imagine you have 4 consumer threads in one ConsumerGroup consuming from one topic with 4 partitions. If you now stop all 4 threads and restart just a single one with the same group the one threads will know where all 4 threads left off consuming and continue from there.
"you are saying one offset (basically a shared int/long value) will be shared/updated by all the consumers in a consumer group?"
Yes, this is correct. Remember that a single partition of a topic can be read only by one consumer thread within a group. Two consumer threads of the same ConsumerGroup will never consume a single topic partition at the same time. The offsets of the consumers groups are stored in an internal Kafka topic called __consumer_offsets. In this topic you basically have a key/value pair, where your key is basically the concatenation of
ConsumerGroup
Topic
Partition within Topic
and your value is the offset. This internal __consumer_offsets topic is available to all consumers so the information is shared.

Clarifications on Apache Kafka

I have few questions on Apache Kafka.
Can a single partition be assigned to more than one consumer from the same group?
Where is the offset stored? Is it in the partition or at the consumer.
Just like the producer always post the record to the lead partition and the records gets replicated to other partitions, Does Kafka consumer reads the data from the lead partition?
Lets say, that a consumer is reading from a partition and the consumer is running a long process. In this case, the rate at which the producer is updating the partition will be faster than the rate at which the consumer is consuming from the same partition. Is there a way we can speed up the consumption from that partition?
Can we create a checkpoint in the commit log on the partition so that the consumer can start processing from that specific checkpoint? This would be useful, if I want to perform the audit from a specific checkpoint onward?
Can a single partition be assigned to more than one consumer from the same group?
No, one partition can be consumed at most from one consumer within the same consumer group as described here: "This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group."
Where is the offset stored? Is it in the partition or at the consumer.
The offsets for each consumer group is stored in an internal kafka topic called __consumer_offsets as described here: "The coordinator of each group is chosen from the leaders of the internal offsets topic __consumer_offsets, which is used to store committed offsets."
Just like the producer always post the record to the lead partition and the records gets replicated to other partitions, Does Kafka consumer reads the data from the lead partition?
Yes it does. The leader partition is the only "client-facing" partition as described here: "'leader' is the node responsible for all reads and writes for the given partition.".
EDIT:
Is there a way we can speed up the consumption from that partition?
The measure to speed up consumption is to increase the partitions of the topic so you can have more consumer threads reading from that topic and process the data in parallel. At the same time you need to make sure that your data is evenly distributed accross partitions.

How multiple consumers from different consumer groups read from same partition?

I have a use case where i have 2 consumers in different consumer groups(cg1 and cg2) subscribing to same topic(Topic A) with 4 partitions.
What happens if both consumers are reading from same partition and one of them failed and other one commited the offset?
In Kafka the offset management is done by Consumer Group per Partition.
If you have two consumer groups reading the same topic and even partition a commit from one consumer group will not have any impact to the other consumer group. The consumer groups are completely discoupled.
One consumer of a consumer group can read data from a single topic partition. A single consumer can't read data from multiple partitions of a topic.
Example Consumer 1 of Consumer Group 1 can read data of only single topic partition.
Offset management is done by the zookeeper.
__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper).
When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
simultaneously two consumers from two different consumer groups(cg1 and cg2) can read the data from same topic.
In kafka 1: Offset management is taken care by zookeeper.
In kafka 2: offsets of each consumer is stored at __Consumer_offsets topic
Offset used for keeping the track of consumers (how much records consumed by consumers), let say consumer-1 consume 10 records and consumer-2 consume-20 records and suddenly consumer-1 got died now whenever the consumer-1 will up then it will start reading from 11th record onward.

Does consumer consume from replica partitions if multiple consumers running under same consumer group?

I am writing a kafka consumer application. I have a topic with 4 partitions - 1 is leader and 3 are followers. Producer uses key to identify a partition to push a message.
If I write a consumer and run it on different nodes or start 4 instances of same consumer, how message consuming will happen ? Does all 4 instances will get same messages ?
What happens in the case of multiple consumer(same group) consuming a single topic?
Do they get same data?
How offset is managed? Is it separate for each consumer?
I would suggest that you read at least first few chapters of confluent's definitive guide to kafka to get a priliminary understanding of how kafka works.
I've kept my answers brief. Please refer to the book for detailed explanation.
How offset is managed? Is it separate for each consumer?
Depends on the group id. Only one offset is managed for a group.
What happens in the case of multiple consumer(same group) consuming a single topic?
Consumers can be multiple - all can be identified by the same or different groups.
If 2 consumers belong to the same group, both will not get all messages.
Do they get same data?
No. Once a message is sent and a read is committed, the offset is incremented for that group. So a different consumer with the same group will not receive that message.
Hope that helps :)
What happens in the case of multiple consumer(same group) consuming a single topic?
Answer: Producers send records to a particular partition based on the record’s key here. The default partitioner for Java uses a hash of the record’s key to choose the partition. When there are multiple consumers in same consumer group, each consumer gets different partition. So, in this case, only single consumer receives all the messages. When the consumer which is receiving messages goes down, group coordinator (one of the brokers in the cluster) triggers rebalance and then that partition is assigned to one of the available consumer.
Do they get same data?
Answer: If consumer commits consumed messages to partition and goes down, so as stated above, rebalance occurs. The consumer who gets this partition, will not get messages. But if consumer goes down before committing its then the consumer who gets this partition, will get messages.
How offset is managed? Is it separate for each consumer?
Answer: No, offset is not separate to each consumer. Partition never gets assigned to multiple consumers in same consumer group at a time. The consumer who gets partition assigned, gets offset as well by default.