Can you explain what kafka-topics.sh --describe is showing? me
I am following a tutorial video and also was reading the Apache documentation but I need a little more clarify as to what I'm looking at for the following columns in this graphic.
Leader: Is this pointing to the 3rd broker or is this pointing to the 3rd partition [2]?
Replicas: Is this pointing to brokers:partitions?
Isr: Is this pointing to brokers:partitions?
I would greatly appreciate it if someone explains what the columns A, B , C, D are.
Topic name: "install_test2"
4 partitions (partition 0, partition 1, partition 2, partition 3) and your replication factor for this topic is 2. It means that data in your topic will be stored (replicated) in 2 brokers for redundancy. In Kafka every partition has a leader and all the requests from producers and consumers are sent to the leader.
Leader column (column B in your image) shows broker ids of the leader for each partition. (Kafka evenly distributes partition leadership between brokers for load balancing)
Replicas column (column C in your image) shows ids of brokers that replicates data for each partition. The first id represents preferred leader. It means Kafka will try to make this broker leader of partition.
ISR (column D in your image) means in-sync-replica. In Kafka when a message is sent to a topic-partition (firstly message is received and stored in leader) and if you have replication factor greater than 1, then replica broker(s) send fetch request and this data is replicated to other broker(s). A follower (replica) broker is in-sync if it is not far behind the leader (explained in below). If a partition leader fails, Kafka chooses an ISR as the new leader for failover.
From Kafka docs:
Configuration parameter replica.lag.time.max.ms now refers not just to
the time passed since last fetch request from replica, but also to
time since the replica last caught up. Replicas that are still
fetching messages from leaders but did not catch up to the latest
messages in replica.lag.time.max.ms will be considered out of sync.
Related
I am a beginner in learning Kafka and was going through topics and producer. As per my understanding
The topic is just a logical name for a group of partitions and the partitions are spread across the nodes.
Is my understanding correct that for a given topic, lets say there are 5 partitions, then all 5 partitions will be on 5 different brokers. And if there is another topic with 5 partitions, then all the 5 partitions will be on 5 brokers. Effectively for this configuration, each of the 5 brokers would have two partitions with each partition of a topic. Am I right?
Another point while the producer is posting a message and the consumer is consuming, is that, the producer will have a list of brokers configured and will post the message to a topic and the list of brokers. The message will always be written to the leader partition. i.e one of the partition on a broker. The message will then be replicated to all the other partitions on other brokers. In this, case, if the producer is configured with only one broker in the producer configuration, does the message be posted to the leader partition in this case too, even in case the broker configuration is not the same as the leader partition for that topic, ex: topic name - events with 5 partitions on 5 brokers. broker-2 is contains the leader partition but the producer is configured with broker-1 alone.
I also read that the producer can specify the partition name also while posting the message. If this is the case, is it not contradicting that the producer will also post the message to the leader partition and if the producer post the message to a custom partition and if the broker containing the custom partition is down, then the message will not be posted. Also in case of distributed systems, it is not a best practice to nail down a specific partition. Am I missing something here?
Does the consumer also reads from the lead partition or the consumer group assigns different consumers to different partition?
For example, I have a topic that has 2 partitions and a producer using defaultpartitioner (round-robin I assumed) writes to the topic. At some point, partition 1 becomes unavailable because all of the replica brokers go offline. Assuming the messages have no specified keys, will the producer resend the messages to partition 2? or simply gets stuck?
That is an interesting question and we should look at it from a broader (cluster) perspective.
At some point, partition 1 becomes unavailable because all of the replica brokers go offline.
I see the following scenarios:
All replica brokers of partition one are different to the replica brokers of partition two.
All replica brokers of partition one are the same as for partition two.
Some replica brokers are the same as for partition two.
In scenario "1" it means you still have enough brokers alive as the replication factor is a topic-wide not a partition-based configuration. In that case as soon as the first broker goes down its data will be moved to another broker to ensure that your partition always has enough in-sync replicas.
In scenarios "2" both partitions become unavailable and your KafkaProducer will eventually time out. Now, it depends if you have other brokers that are alive and can take on the data of the partitions.
In scenario "3" the dead replicas would be shifted to running brokers. During that time the KafkaProducer will only write to partition 2 as this is the only available partition in the topic. As soon as partition 1 has enough in-sync replicas the producer will start producing again to both partitions.
Actually, I could think of many more scenarios. If you need a more concrete answer you need to specify
how many brokers you have,
what your replication factor actually is and
in what timely order which broker goes down.
Assuming the messages have no specified keys, will the producer resend the messages to partition 2?
The KafkaProducer will not re-send the data that was previously send to partition 1 to partition 2. Whatever was written to partition 1 will stay in partition 1.
I'm trying to understand how does the no of partitions would affect the partition leader (broker)?
Let's say, I've a kafka cluster with 1 zookeeper, 3 brokers and 1 schema registry. My topic replication factor is 1. Now, If I've two topics A and B with 5 partitions.
Now, let's say if i send a message to topic A with key key1 and assume that based on the partitioning strategy, it is ended up being redirected to partition 5 of topic A and the leader for the partition 5 of topic A is broker 2.
In this scenario, If i send a message to topic B with key key1 (same as key as the message that was sent onto Topic A), then can we assume that it would go to partition 5 on the broker 2?
There is no correlation between leadership and partitioning
You can guarantee that the same key will be hashed the same, and go to the same partition (assuming matching counts), but you cannot guarantee which broker will be the leader
A general question. Assume a topic has 3 kafka partitions on different servers (brokers), each partition has 10 message with offset as its timestamp (0,1,...,9, greater number means stayed shorter time in partition, also means is newly came message). Let's say one partition happen to shut down since the server is done. What's the strategy for Kafka to re-balance the 10 message in the shut down partition into other partitions?
Visually, we have
broker 1 partition: |1-0|1-1|1-2|1-3|1-4|1-5|1-6|1-7|1-8|1-9|
broker 2 partition: |2-0|2-1|2-2|2-3|2-4|2-5|2-6|2-7|2-8|2-9|
broker 3 partition: |3-0|3-1|3-2|3-3|3-4|3-5|3-6|3-7|3-8|3-9|
Now if broker 3 is done, how will 3-0 to 3-9 be inserted into broker 1 and broker 2?
( My assumption is by default it will be spread half half randomly and inserted based on timestamp of broker 3, attached to tail of broker 1 and 2, and maybe there is somewhere one can configure behavior by code?)
Thanks in advance.
If a partition only exists on a single broker (replication factor 1) then when this broker is offline, the partition is not available. This is what you drew in your question.
To keep data available even when brokers go down you have to create topics with a replication factor greater than 1.
Then the data of the partition will be replicated onto several brokers and if one of them go offline, user traffic will be rediected to the available replicas.
I suggest you to go through the Replication section in the docs to understand how this works.
The below diagram will help you understand how Kafka replicates partitions. If one broker is down, the consumer can read from the other broker because Kafka has a replication ability. (Of course, you need to set it like below)
For example, if broker 1 dies, broker 2 will become a leader of topic1-part1, and a consumer can read from it.
Zookeeper will know if a broker( partition) is down, it will appoint another leader.
Trying to Understand the relationship between replication factor and Consumer group . Example : Number of partition = 2 Number of replication = 3 Number consumers in consumer group = 4 . In this case ,
How many consumer will receive the message ?
How This replication will impact the number of consumer to receive .
For your first question, since you have two partitions in your example, only 2 of the 4 consumers will actually get data. The other two consumers will not have any partitions assigned to them, because there aren't any partitions left for that consumer group. If you had a different consumer group, then those consumers would still be assigned partitions.
Additionally, in this case, you mention there's only a single message coming through. Depending on which partition it's assigned to, the message will only be sent to that partition. So in this case, only one of the four consumers will get the message, the one that had that partition assigned to it.
As for your second question, replication factor configuration in Kafka doesn't impact the number of messages consumers receive. Replication, as far as consumers and producers are concerned, is an internal kafka cluster detail that they don't need to worry about. As long as they're producing/consuming to/from the leader of the partition, that's all they need to know. A topic could have replication factor 2, and another one could have replication factor 10, and they would both behave identically to producers and consumers.
There's a few more details in the official Kafka documentation: https://kafka.apache.org/documentation/#theconsumer
To give some additional details on the replication factor, it doesn't have any relation whatsoever to the number of consumers receiving messages from the topic. Replication serves only one major purpose, and that is High Availability. So, let's say you have 3 brokers in a cluster, and for a topic my-topic you've set replication factor as 2. Now, if at-most one broker goes down at some point of time, you'd still be okay, as the messages are replicated in another broker for the topic.