Kafka : Use common consumer group to access multiple topics - apache-kafka

Our cluster runs Kafka 0.11 and has strict restrictions on using consumer groups. We cannot use arbitrary consumer groups so Admin has to create required consumer groups.
We run Kafka Connect HDFS Sinks to read data from topics and write to HDFS. All the topics have only one partition.
I can consider following two patterns when using Consumer Groups in Kafka HDFS Sink.
As shown in the pictures:
Case 1: Each topic has its own Consumer Group
Case 2: All the topics have a common Consumer Group
I am aware that when a topic has multiple partitions, and if a consumer failed, another consumer in the same consumer group take over that partition.
My question :
Does the same thing happen when multiple topics share the same consumer group? ie: if a Consumer failed(HDFS Sink), will another Consumer(HDFS Sink connector) takeover the work and read from that topic?
Update: Each Kafka HDFS Sink Connector subscribed to only one topic.

I'm surprised that all answers with "yes" are wrong. I just tested it and having the same group.id for consumers for different topic works well and does NOT mean that they share messages, because for Kafka the key is (topic, group) rather than just (group). Here is what I did:
created 2 different topics T1 and T2 with 2 partitions in each topic
created 2 consumers with the same group xxx
assigned consumer C1 to T1, consumer C2 to T2
produced messages to T1 - only consumer C1 assigned to T1 processed them
produced messages to T2 - only consumer C2 assigned to T2 processed them
killed consumer C1 and repeated 4-5 steps. Only consumer C2 processed messages from T2
messages from T1 were not processed
Conclusion: Consumers with the same group name subscribed to different topics will NOT consume messages from other topics, because the key is (topic, group)

Absolutely yes. The kafka consumers should monitor both topics and then, kafka will assign the partitions (per topic) to the current active members of the consumer group.
Regardless of having one or multiple partitions on every single topic, the consumers will take charge of monitoring the partitions per topic whenever a consumer failure happens in the same group.
When a failure happens, the Kafka will always trigger the re-balancing process in order to distribute the partitions to the remaining active consumers of the group and as a consequence, the work will continue running on that topics.

yes, as long as both consumers subscribe() the the same set of topics (topicA and topicB) the partitions of all topics will be distributed across all consumers.
in your case this would mean that if one of the consumers fails, both topics will be assigned to the surviving consumer.

The question asked is in the event of consumer fails in a consumer group, will the consumers available in the same group pick up the subscribed topics and starts processing again or not?.
But the accepted answer has the scenario where the topics are assigned to consumers, but if its auto assignment(i.e., subscribe) then the consumers that are idle in the group should pick the job of failed consumer and starts reading from the last committed offset. If its not then its breaking the consumer group parallelism architecture.
just look at this answer. Kafka consumer for multiple topic

Related

What is the need of consumer group in kafka?

I don't understand the practical use case of the consumer group in Kafka.
A partition can only be read by only one consumer in a consumer group, so only a subset of a topic record is read by one consumer.
Can someone help with any practical scenario where the consumer group helps?
It's for parallel processing of event messages from the specific topic.
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
Read more here:
https://docs.confluent.io/5.3.3/kafka/introduction.html#consumers

What Happens when there is only one partition in Kafka topic and multiple consumers?

I have a Kafka Topic with only one partition and I am not getting what will happen in following cases? How messages will be delivered to consumers?
If all consumers are in same group
If all consumers are in different group
I am not sure if consumers will receive unique messages or duplicate ones.
Each Consumer subscribes to a/more partition in a topic. And each consumer belongs to a consumer group. Below are two scenarios:
When all consumers belong to the same group : Each consumer will try to subscribe to a different partition. In case,if there is only one partition, only one consumer will get the messages, while other consumers will be idle.
When all consumers belong to the different consumer group: Each consumer will get the messages from all partitions. Partition subscription is based on the consumer groups.
It depends on the consumer groups. Consumers within the same consumer group don't read the data again from the same partitions once the read offsets have been committed.

How multiple consumer group consumers work across partition on the same topic in Kafka?

I was reading this SO answer and many such blogs.
What I know:
Multiple consumers can run on a single partition when running multiple consumers with multiple consumer group id and only one consumer from a consumer group can consume at a given time from a partition.
My question is related to multiple consumers from multiple consumer groups consuming from the same topic:
What happens in the case of multiple consumers(different groups) consuming a single topic(eventually the same partition)?
Do they get the same data?
How offset is managed? Is it separate for each consumer?
(Might be opinion based) How do you or generally recommended way is to handle overlapping data across two consumers of a separate group operating on a single partition?
Edit:
"overlapping data": means two consumers of separate consumer groups operating on the same partition getting the same data.
Yes they get the same data. Kafka only stores one copy of the data in the topic partitions' commit log. If consumers are not in the same group then they can each get the same data using fetch requests from the clients' consumer library. The assignment of which partitions each group member will get is managed by the lead consumer of each group. The entire process in detailed steps is documented here https://community.hortonworks.com/articles/72378/understanding-kafka-consumer-partition-assignment.html
Offsets are "managed" by the consumers, but "stored" in a special __consumer_offsets topic on the Kafka brokers.
Offsets are stored for each (consumer group, topic, partition) tuple. This combination is also used as the key when publishing offsets to the __consumer_offsets topic so that log compaction can delete old unneeded offset commit messages and so that all offsets for the same (consumer group, topic, partition) tuple are stored in the same partition of the __consumer_offsets topic (which defaults to 50 partitions)
Each consumer group gets every message from a subscribed topic.
Yes
Offset are stored by partition. For example let's say you have a topic with 2 partitions and a consumer group named cg made up of 2 consumers. In that case Kafka assigns each of the consumers one of the partitions. Then the consumers fetch the offset for the partition they were assigned to from Kafka (e.g. consumer 'asks' Kafka: "What is the offset for this topic for consumer group cg partition 1", or partition 2 for the other consumer). After getting the correct offset the consumer polls some Kafka broker for the next message in that partition.
I'm not entirely sure what you mean by overlapping data, can you clarify a bit or give an example?

How does Kafka help to realize the abstraction of queuing as well as publish-subscribe?

The Kafka documentation states that:
Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers.
If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
I've a couple of doubts regarding this:
1) Why will the published message go to a single consumer instance of a consumer group? Isn't it the responsibility of consumers to read from the partitions? What does go even mean here?
2)The consumers which are interested in particular topics should just read from the partition they're interested in. What's the relevance of consumer groups?
3) And how does this help to realize the abstraction of queue and publisher-subscirber?
In Kafka a topic can have multiple partitions, if a consumer group has X number of consumers, the partitions for that topic will be split among the consumers. (i.e: if you have 1 topic with 2 partitions, and you have a consumer group with 2 consumers, each consumer will consume from 1 partition, in the same scenario if the consumer group only has 1 consumer, that consumer will read from 2 partitions)
The consumer group basically coordinates (is a coordinator) the different consumers with the topic/s and partitions. If you have 4 consumers in the same CG and 1 crashes the consumer group will give the partitions of the crashed consumer to the other consumers available in the same CG so the information in those partition is processed (if the CG didn't redistribute the different partitions some of the partitions will never be read if a consumer crashes).
If the consumers are in the same CG then the information that is sent to the topic is distributed among them. If each of the consumers has a different CG then they will all get all the messages.
Hope it's more clear now, the Kafka documentation needs improvement.

Kafka consumer & Partition query

I am new to Kafka and read few tutorials. I couldn't understand the relationship between consumer and partition.
Please address my below queries.
As per documentation, only one consumer can consume message in group. Why do we need to create more consumers in that same group? What is the benefit?
Does consumer are assigned to individual partition by ZK? , if Yes, if producer sends message to different partition then how will other partition’s consumer consume the message ?
I have one topic and that has 3 partitions. I post msg, it goes to P0. I have 5 consumers (different consumer group). Will all consumers read message from P0? if I increase many Consumer, will all read message from same P0 ?
If all consumer read from same PO then how performance will be high?
How rebalancing is working? will it work when you increase consumer group or consumer in same group ?
Please clarify my questions and give some example.
Yes, only once consumer in consumer group can consume message from one partition, rest of consumer in the same group will be assigned to remaining partition to do parallel process. Advantage is parallel processing.
Yes partition will be assigned to consumer by ZK. Based on partition count and consumer count, allocation will be done. Ex: Topic (Test) has 3 Partition (P1, P2, and P3). We have one consumer (C1). C1 will read message from all partition. If you add one more consumer in that same group (c2). ZK will assign P1, p2 to C1 and P3 goes to C2. If add one more consumer (C3) than P1=C1, P2=C2 and P3=C3. No of consumer should not be greater than no of partition for that topic.
Above point will answer this one.
Rebalancing will work when you add consumer on the same consumer group.