Kafka producer message flow - apache-kafka

I have a topic called Topic1 with two partitions. Assume that serverA is leader for Topic1, partition1 and serverB is a follower.
What will happen if in my client I publish to serverB (in the broker.list I specify only serverB)? How does the message propagate? Is it sent to serverB and then to serverA.

I've found this document to be very helpful in explaining what's going on internally with Kafka.
There's an API used by the producer to ask any one Kafka server for the list of partitions and all metadata for those partitions. That metadata includes the leader broker for each partition. The producer calls the partitioner to get the target partition and then talks directly with that partition's Kafka broker leader to write the message onto the partition. The leader will handle communications with any other brokers managing replicas of the partition.

To publish a message to a partition, the client first finds the leader of the partition from Zookeeper and sends the message to the leader.
The leader writes the message to its local log. Each follower constantly pulls new messages from the leader using a single socket channel.The follower writes each received message to its own log and sends an acknowledgment back to the leader. Once the leader receives the acknowledgment from all replicas in ISR, the message is committed
So to answer your question, if the client publishes to serverB, it consults zookeeper to know the leader for Topic1 and partition1. Zookeeper responds saying serverA is the leader for partition1. So the client sends the message to serverA.(Here I assumed that the partitioner will send the message to partition1.)
All these are handled by the kafka producer. End user application need not worry about these details.
You can read more about this here

Related

How client will automatically detect a new leader when the primary one goes down in Kafka?

Consider the below scenario:
I have a Kakfa broker cluster(localhost:9002,localhost:9003,localhost:9004,localhost:9005).
Let's say localhost:9002 is my primary(leader) for the cluster.
Now my producer is producing data and sending it to the broker(localhost:9002).
If my primary broker(localhost:9002) goes down, with the help of Zookeeper or some other consensus algorithm new leader will be elected(consider localhost:9003 is now the new leader).
So, in the above scenario can someone please explain to me how the Kafka client(producer) will get notified about the new broker configuration(localhost:9003) and how it will connect to the new leaders and start producing data again.
Kafka clients are receiving the necessary meta information from the cluster automatically on each request when reading from or writing to a topic in case of a leadership change.
In general, the client sends a (read/write) request to one of the bootstrap server, listed in the configuration bootstrap.servers. This initial request (hence called bootstrap) returns the details on which broker the topic partition leader is located so that the client can communicate directly with that broker. Each individual broker contains all meta information for the entire cluster, meaning also having the knowledge on the partition leader of other brokers.
Now, if one of your broker goes down and the leadership of a topic partition switches, your producer will get notified about it through that mechanism.
There is a KafkaProducer configuration called metadata.max.age.ms which you can modify to update metadata on your producer even if there is no leadership change happening:
"Controls how long the producer will cache metadata for a topic that's idle. If the elapsed time since a topic was last produced to exceeds the metadata idle duration, then the topic's metadata is forgotten and the next access to it will force a metadata fetch request."
Just a few notes on your question:
The term "Kafka broker cluster" does not really exists. You have a Kafka cluster containing one or multiple Kafka brokers.
You do not have a broker as a "primary(leader) for the cluster" but you have for each TopicPartition a leader. Maybe you mean the Controller which is located on one of the brokers within your cluster.

Does kafka broker always check if its the leader while responding to read/write request

I am seeing org.apache.kafka.common.errors.NotLeaderForPartitionException on my producer which I understand happens when producer tries to produce messages to a broker which is not a leader for the partition.
Does that mean each time a leader fulfills a write request it first checks if its the leader or not?
If yes does that translates to a zookeeper request for every write request to know if the node is the leader?
How Producer Get MetaData About Brokers
The producer sends a meta request with a list of topics to one of the brokers you supplied when configuring the producer.
The response from the broker contains a list of partitions in those topics and the leader for each partition. The producer caches this information and therefore, it knows where to redirect the messages.
When Producer Will Refresh MetaData
I think this depends what kafka client you used.There are some small differents between ruby, java or other kafka client.for example, in java:
producer will fetch metadata when client initialize,then period update it depends on expiration time.
producer also will force update metadata when request error occured,such as InvalidMetadataException.
But in ruby-kafka client, it usually refresh metadata when error occured or initialize.

Does zookeeper returns Kafka brokers dns? (which has the leader partition)

I'm new to Kafka and I want to ask a question.
If there are 3 Kafka brokers(kafka1, kafka2, kafka3)(they are in same Kafka Cluster)
and topic=test(replication=2)
kafka1 has leader partition and kafka2 has follower partition.
If producer sends data to kafka3, then how do data stored in kafka1 and Kafka2?
I heard that, if producer sends data to kafka3, then the zookeeper finds the broker who has the leader partition and returns the broker's dns or IP address.
And then, producer will resend to the broker with metadata.
Is it right? or if it's wrong, plz tell me how it works.
Thanks a lot!
Every kafka topic partition has its own leader. So if you have 2 partitions , kafka would assign leader for each partition. They might end up being same kafka nodes or they might be different.
When producer connects to kafka cluster, it gets to know about the the partition leaders. All writes must go through corresponding partition leader, which is responsible to keep track of in-sync replicas.
All consumers only talk to corresponding partition leaders to get data.
If partition leader goes down , one of the replicas become leader and all producers and consumers are notified about this change

how producers find kafka reader

The producers send messages by setting up a list of Kafka Broker as follows.
props.put("bootstrap.servers", "127.0.0.1:9092,127.0.0.1:9092,127.0.0.1:9092");
I wonder "producers" how to know that which of the three brokers knew which one had a partition leader.
For a typical distributed server, either you have a load bearing server or have a virtual IP, but for Kafka, how is it loaded?
Does the producers program try to connect to one broker at random and look for a broker with a partition leader?
A Kafka cluster contains multiple broker instances. At any given time, exactly one broker is the leader while the remaining are the in-sync-replicas (ISR) which contain the replicated data. When the leader broker is taken down unexpectedly, one of the ISR becomes the leader.
Kafka chooses one broker’s partition’s replicas as leader using ZooKeeper. When a producer publishes a message to a partition in a topic, it is forwarded to its leader.
According to Kafka documentation:
The partitions of the log are distributed over the servers in the
Kafka cluster with each server handling data and requests for a share
of the partitions. Each partition is replicated across a configurable
number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
You can find topic and partition leader using this piece of code.
EDIT:
The producer sends a meta request with a list of topics to one of the brokers you supplied when configuring the producer.
The response from the broker contains a list of partitions in those topics and the leader for each partition. The producer caches this information and therefore, it knows where to redirect the messages.
It's quite an old question but I have the same question and after researched, I want to share the answer cuz I hope it can help others.
To determine leader of a partition, producer uses a request type called a metadata request, which includes a list of topics the producer is interested in.
The broker will response specifies which partitions exist in the topics, the replicas for each partition, and which replica is the leader.
Metadata requests can be sent to any broker because all brokers have a metadata cache that contains this information.

I wonder if the way Kafka works is what I understand

I wonder if the structure I understood is correct.
On Kafka's side are producers, consumers, and the zookeeper ensemble.
It communicates on the topic, and the partition is a parallel processing unit, and only the replay leader can input and output.
When the producers ask a random Kafka about " what partition has and who is the leader, ".
And that kafka ask to zookeeper, And responds.
The producer who gets the answer sends a message to Kafka Broker.
The broker then receives it, saves it, and informs any zookeeper.(Increased Offset)
Here, the zookeeper sends the information receives to the zookeeper leader, who informs the other zookeeper.
If the leader dies, the leader is replaced in the " Zookeeper Arctic broadcast " way.
In addition, if the Kafka partition leader dies, the Kafka Controller recognizes it and manages the partition leader.
When the entire process is completed, it sends a message to consumers.
Is this correct?
I have a question.
If only one producer is too busy, only the partition that receives that information becomes very busy?
Does it have to be managed manually?