I am a beginner in learning Kafka and was going through topics and producer. As per my understanding
The topic is just a logical name for a group of partitions and the partitions are spread across the nodes.
Is my understanding correct that for a given topic, lets say there are 5 partitions, then all 5 partitions will be on 5 different brokers. And if there is another topic with 5 partitions, then all the 5 partitions will be on 5 brokers. Effectively for this configuration, each of the 5 brokers would have two partitions with each partition of a topic. Am I right?
Another point while the producer is posting a message and the consumer is consuming, is that, the producer will have a list of brokers configured and will post the message to a topic and the list of brokers. The message will always be written to the leader partition. i.e one of the partition on a broker. The message will then be replicated to all the other partitions on other brokers. In this, case, if the producer is configured with only one broker in the producer configuration, does the message be posted to the leader partition in this case too, even in case the broker configuration is not the same as the leader partition for that topic, ex: topic name - events with 5 partitions on 5 brokers. broker-2 is contains the leader partition but the producer is configured with broker-1 alone.
I also read that the producer can specify the partition name also while posting the message. If this is the case, is it not contradicting that the producer will also post the message to the leader partition and if the producer post the message to a custom partition and if the broker containing the custom partition is down, then the message will not be posted. Also in case of distributed systems, it is not a best practice to nail down a specific partition. Am I missing something here?
Does the consumer also reads from the lead partition or the consumer group assigns different consumers to different partition?
Related
I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.
I read a kafka documentation, but I still confused, when someone talk about data and partitions.
In documentation I see that client will send message to partition.
Then partition replicate message to replicas (across brokers).
And consumer read data from partition.
I have an topic which have 2 partitions.
Let's say I have one producer, which send message to partition#1.
But I have 2 consumers, one read from partition#1, and second from partition#2.
Is it mean that my partition#1 will have 50% messages, and partition#2 will have 50%. Or when client send data to partition#1 then partition#1 should be replicate data not only across brokers, but and for across partitions?
About your specific example ...
If your producer sends messages without a key on the message, the default partitioner (in the producer itself) will apply a round robin algorithm to send messages to partitions so: message 1 to partition 1, messages 2 to partition 2, message 3 to partition 1 and so on. It means that you are right, partition 1 will get 50% of messages. So one consumer reading from partition 1 will get 50% of sent messages; the other 50% will be got by the other consumer reading from partition 2. This is how Kafka works for having higher throughtput and handling more consumers.
It's important to add that when a partition has more replicas, one of them is defained "leader" and the other ones are "followers". The messages exchange happens always through the "leader". The "followers" are just copies. They are used in case the broker hosting the "leader" partition goes down and another broker which hosts a "follower" partition is elected as "leader".
I hope this helps.
Say I have a topic T1 with 3 partitions i.e. P1,P2 and P3. Where p1 is leader and rest are followers.
Now there are 2 producers want to push to same topic T1. I believe P1 will be leader for both of them ? Also single offset will be maintained
for both of them or offset is maintainer per partition per producer ?
Now I have single consumer which is polling from T1. Will it get messages from both producers by default or it has to explicitly mention producer name if it
wants message from specfic producer ?
Leader is not dependent on the producers or consumers, so p1 will be always returned as a leader. Offsets are not important for producers, they are defined per consumer group. Offset determines, which messages were read and committed by a consumer group.
Consumer will always read all the messages, it does not matter, which producer published them.
You're maybe mixing up replicas and partitions. When you say you have a topic with 3 partitions, it means your records will be dispatched amongs them according to the record key ( or dispatcher algo) .
There is no ' leader partition' . However you have a leader broker that handle a partition. In your case you will have 3 leaders, each of them managing one of your 3 partitions.
An interstingng post here, regarding Kafka partitions:
Understanding Kafka Topics and Partitions
Yannick
I was reading this SO answer and many such blogs.
What I know:
Multiple consumers can run on a single partition when running multiple consumers with multiple consumer group id and only one consumer from a consumer group can consume at a given time from a partition.
My question is related to multiple consumers from multiple consumer groups consuming from the same topic:
What happens in the case of multiple consumers(different groups) consuming a single topic(eventually the same partition)?
Do they get the same data?
How offset is managed? Is it separate for each consumer?
(Might be opinion based) How do you or generally recommended way is to handle overlapping data across two consumers of a separate group operating on a single partition?
Edit:
"overlapping data": means two consumers of separate consumer groups operating on the same partition getting the same data.
Yes they get the same data. Kafka only stores one copy of the data in the topic partitions' commit log. If consumers are not in the same group then they can each get the same data using fetch requests from the clients' consumer library. The assignment of which partitions each group member will get is managed by the lead consumer of each group. The entire process in detailed steps is documented here https://community.hortonworks.com/articles/72378/understanding-kafka-consumer-partition-assignment.html
Offsets are "managed" by the consumers, but "stored" in a special __consumer_offsets topic on the Kafka brokers.
Offsets are stored for each (consumer group, topic, partition) tuple. This combination is also used as the key when publishing offsets to the __consumer_offsets topic so that log compaction can delete old unneeded offset commit messages and so that all offsets for the same (consumer group, topic, partition) tuple are stored in the same partition of the __consumer_offsets topic (which defaults to 50 partitions)
Each consumer group gets every message from a subscribed topic.
Yes
Offset are stored by partition. For example let's say you have a topic with 2 partitions and a consumer group named cg made up of 2 consumers. In that case Kafka assigns each of the consumers one of the partitions. Then the consumers fetch the offset for the partition they were assigned to from Kafka (e.g. consumer 'asks' Kafka: "What is the offset for this topic for consumer group cg partition 1", or partition 2 for the other consumer). After getting the correct offset the consumer polls some Kafka broker for the next message in that partition.
I'm not entirely sure what you mean by overlapping data, can you clarify a bit or give an example?
Just wanna understand the basics properly.
Let's say I've a topic called "myTopic" that has 3 partitions P0, P1 & P2.
Each of these partitions will have a leader and the data (messages) for this topic is distributed across these partitions.
1. Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
2. How do the producer know the leader of the partition?
3. Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Appreciate your help.
Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
By default, yes.
That said, a producer can also decide to use a custom partitioning scheme, i.e. a different strategy to which partitions data is being written to.
How do the producer know the leader of the partition?
Through the Kafka protocol.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
By default, yes.
That said, you can also implement e.g. consumer applications that implement custom logic, e.g. a "sampling" consumer that only reads from 1 out of N partitions.
Producer will always writes to the leader of the partition
Yes, always.
in a round robin fashion based on the load on the broker
No. If a partition is explicitly set on a ProducerRecord then that partition is used. Otherwise, if a custom partitioner implementation is provided, that determines the partition. Otherwise, if the msg key is not null, the hash of the key will be used to consistently send msgs with the same key to the same partition. If the msg key is null, only then the msg will indeed be sent to any partition in a round-robin fashion. However, this is irrespective of the load on the broker.
How do the producer know the leader of the partition?
By periodically asking the broker for metadata.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Consumers form consumer groups. If there are multiple consumer instances in a consumer group, each consumes a subset of the partitions. But the consumer group as a whole consumes from all partitions. That is, unless you decide to go "low-level" and manage that yourself, which you can do.