I am reading Kafka documentation and trying to understand the working of it - apache-kafka

I am reading Kafka documentation and trying to understand the working of it. This is regarding consumers. In brief, a topic is divided into number of partitions. There are number of consumer groups, each having number of consumer instances. Now, my question is, does each partition sends sends "same" message to each consumer groups, which in turn is given to specific consumer instance within the group?
If it is, how does Kafka ensures the message is processed only by one consumer?
Kindly guide me if I am missing something.

Well to put it simply :
we have topic divided into partitions.
we have consumer that consume data from thoses topics.
Consumers are part of consumer group by sharing the same group.id.
From a topic every partitions is consumed by one consumer within a consumer groups.
Example :
Topic "test" with 3 partitions.
Consumer group A : with 3 consumers
Consumer group B : with 2 consumers.
Ths two consumer groups A and B consumes data from the topic "test".
Within the group A every consumer (so 3) will consume one partition each whereas in group consumer B (with 2 consumer) , one consumer will read 2 partitions and the other will consume the last one.
If we have a last consumer group with only one consumer inside, it will read all 3 partitions of that topic.
Hope that's help, let me know if you didn't understand.


How does the kafka consumer of the same group share messages between them?

Say, there is a Consumer group. (Consumers with the same group ID).
The Consumer group is consuming Topic A from a Broker.
Topic A has 4 partitions, and there are 4 Consumers in that group.
Each Consumer consumes different partition. ( Consumer 1 takes messages in partition 1, Consumer 2 takes messages in partition 2 and so on because that's what consumer group does in kafka. Among Consumer Group, each has 1/4 of the topic.
My question : How do they share the message so that they all have Topic A?
How do they combine those bits and pieces? and where does this take place?
If my computer (consumer 1 of group A) consumes Topic A from a Broker, and my friend's computer (consumer 2 of group A) consumes other pieces of the same Topic, how do we combine the message in Topic A?
I thought of the term 'Consumer' a computer or a server consuming a Topic from a Broker. That's why I got confused with Consumer group.
Consumer is a client or a program, and I can have many consumer's on my computer or server. Consumer Group means multiple consumer processes on an independent machine
So I don't need to worry about Consumer's in a group sharing bits of message to complete a Topic. Previously, I thought each consumer being a server or a computing resource, so they had to communicate somehow. But that's how I got confused. They don't need to communicate to each other over the network or need a pool to share their consumed partitioned.
Consumer 1 can read from partition 1, Consumer 2 can read from partition 2, and if Consumer 1, 2 share the same group ID (Consumer group), Consumer 1 doesn't need to read from partition 2, and Consumer 2 doesn't need to read from partion 1. They already have a Topic they need. Boom!
I posted an answer to help someone who thought like me.

Scaling up kafka consumer applications

Lets say I have one consumer group which subscribed to 4 topics and partitions for each topics are:-
First topic => 5 partitions
Second topic => 3 partitions
Third topic => 2 partitions
Fourth topic => 1 partitions
Total number of partitions = 11. So total how many applications I can run.
5(max number of partitions in input topics) or 11?
In kafka, scaling consumers depends on partition number.
Lets assume you have one topic with 3 partitions. And you have 2 different consumer app (different consumer groups) which does different works.
You can scale your consumer number up to 3 for per consumer group.
Single consumer (consumer group A) can consume messages from 3
Two consumer (same consumer group) can not consume single
Take look at image : https://hadoopabcd.files.wordpress.com/2015/04/consumer-group.png
Read more about consumer groups blog series : https://dzone.com/articles/understanding-kafka-consumer-groups-and-consumer-l
In ideal situation the number of consumer in the consumer group should be equal to the number of partition. If that is not the case then you can have more then one consumer group kafka provides the feature that 2 consumer from the different consumer group can read from the same partition. That’s totally depends on your resources how many resources do you have for running the consumers.
Suppose you have an application that needs to read messages from a Kafka topic, run some validations against them, and write the results to another data store. In this case your application will create a consumer object, subscribe to the appropriate topic, and start receiving messages, validating them and writing the results. This may work well for a while, but what if the rate at which producers write messages to the topic exceeds the rate at which your application can validate them? If you are limited to a single consumer reading and processing the data, your application may fall farther and farther behind, unable to keep up with the rate of incoming messages. Obviously there is a need to scale consumption from topics. Just like multiple producers can write to the same topic, we need to allow multiple consumers to read from the same topic, splitting the data between them.
Kafka consumers are typically part of a consumer group. When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic.
Please refer to this https://www.safaribooksonline.com/library/view/kafka-the-definitive/9781491936153/ch04.html

How multiple consumer group consumers work across partition on the same topic in Kafka?

I was reading this SO answer and many such blogs.
What I know:
Multiple consumers can run on a single partition when running multiple consumers with multiple consumer group id and only one consumer from a consumer group can consume at a given time from a partition.
My question is related to multiple consumers from multiple consumer groups consuming from the same topic:
What happens in the case of multiple consumers(different groups) consuming a single topic(eventually the same partition)?
Do they get the same data?
How offset is managed? Is it separate for each consumer?
(Might be opinion based) How do you or generally recommended way is to handle overlapping data across two consumers of a separate group operating on a single partition?
"overlapping data": means two consumers of separate consumer groups operating on the same partition getting the same data.
Yes they get the same data. Kafka only stores one copy of the data in the topic partitions' commit log. If consumers are not in the same group then they can each get the same data using fetch requests from the clients' consumer library. The assignment of which partitions each group member will get is managed by the lead consumer of each group. The entire process in detailed steps is documented here https://community.hortonworks.com/articles/72378/understanding-kafka-consumer-partition-assignment.html
Offsets are "managed" by the consumers, but "stored" in a special __consumer_offsets topic on the Kafka brokers.
Offsets are stored for each (consumer group, topic, partition) tuple. This combination is also used as the key when publishing offsets to the __consumer_offsets topic so that log compaction can delete old unneeded offset commit messages and so that all offsets for the same (consumer group, topic, partition) tuple are stored in the same partition of the __consumer_offsets topic (which defaults to 50 partitions)
Each consumer group gets every message from a subscribed topic.
Offset are stored by partition. For example let's say you have a topic with 2 partitions and a consumer group named cg made up of 2 consumers. In that case Kafka assigns each of the consumers one of the partitions. Then the consumers fetch the offset for the partition they were assigned to from Kafka (e.g. consumer 'asks' Kafka: "What is the offset for this topic for consumer group cg partition 1", or partition 2 for the other consumer). After getting the correct offset the consumer polls some Kafka broker for the next message in that partition.
I'm not entirely sure what you mean by overlapping data, can you clarify a bit or give an example?

Kafka Consumer from different group consuming from different partition of Topic

I have a scenario where I have deployed 4 instances of Kafka Consumer on different nodes. My topic has 4 partitions. Now, I want to configure the Consumers in such a way that they all fetch from different partitions of the topic.
I know for a fact that if the Consumers are from the same consumer group, they ensure that the partitions are split equally. But in my case, they are not in the same group.
In order to achieve what you want you need the consumers being in the same consumer group. Only in this case a "competing consumer" pattern is applied : each consumer receives 1 partition from the 4, so you have 4 consumers each one reading from 1 partition and receiving messages for that partitions.
When consumers are part of different consumer groups, each consumer will be assigned to all 4 partitions receiving messages from all of them in a publish/subscribe way.

If you have less consumers than partitions, what happens?

If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
What if you have multiple consumers on a given topic#partition? I guess the consumer has to somehow keep track of what messages it has already processed in case of duplicates?
In fact, each consumer belongs to a consumer group. When Kafka cluster sends data to a consumer group, all records of a partition will be sent to a single consumer in the group.
If there're more paritions than consumers in a group, some consumers will consume data from more than one partition. If there're more consumers in a group than paritions, some consumers will get no data. If you add new consumer instances to the group, they will take over some partitons from old members. If you remove a consumer from the group (or the consumer dies), its partition will be reassigned to other member.
Now let's take a look at your questions:
If you have less consumers than partitions, does that simply mean you will not consume all the messages on a given topic?
NO. Some consumers in the same consumer group will consume data from more than one partition.
In a cloud environment, how are you suppose to keep track how many consumers are running and how many are pointing to a given topic#partition?
Kafka will take care of it. If new consumers join the group, or old consumers dies, Kafka will do reblance.
What if you have multiple consumers on a given topic#partition?
You CANNOT have multiple consumers (in a consumer group) to consume data from a single parition. However, if there're more than one consumer group, the same partition can be consumed by one (and only one) consumer in each consumer group.
1) No that means you will one consumer handling more than one consumer.
2) Kafka never assigns same partition to more than one consumer because that will violate order guarantee within a partition.
3) You could implement ConsumerRebalanceListener, in your client code that gets called whenever partitions are assigned or revoked from consumer.
You might want to take a look at this article specically "Assigning partitions to consumers" part. In that i have a sample where you create topic with 3 partitions and then a consumer with ConsumerRebalanceListener telling you which consumer is handling which partition. Now you could play around with it by starting 1 or more consumers and see what happens. The sample code is in github