I wonder if the way Kafka works is what I understand - apache-kafka

I wonder if the structure I understood is correct.
On Kafka's side are producers, consumers, and the zookeeper ensemble.
It communicates on the topic, and the partition is a parallel processing unit, and only the replay leader can input and output.
When the producers ask a random Kafka about " what partition has and who is the leader, ".
And that kafka ask to zookeeper, And responds.
The producer who gets the answer sends a message to Kafka Broker.
The broker then receives it, saves it, and informs any zookeeper.(Increased Offset)
Here, the zookeeper sends the information receives to the zookeeper leader, who informs the other zookeeper.
If the leader dies, the leader is replaced in the " Zookeeper Arctic broadcast " way.
In addition, if the Kafka partition leader dies, the Kafka Controller recognizes it and manages the partition leader.
When the entire process is completed, it sends a message to consumers.
Is this correct?
I have a question.
If only one producer is too busy, only the partition that receives that information becomes very busy?
Does it have to be managed manually?

Related

How client will automatically detect a new leader when the primary one goes down in Kafka?

Consider the below scenario:
I have a Kakfa broker cluster(localhost:9002,localhost:9003,localhost:9004,localhost:9005).
Let's say localhost:9002 is my primary(leader) for the cluster.
Now my producer is producing data and sending it to the broker(localhost:9002).
If my primary broker(localhost:9002) goes down, with the help of Zookeeper or some other consensus algorithm new leader will be elected(consider localhost:9003 is now the new leader).
So, in the above scenario can someone please explain to me how the Kafka client(producer) will get notified about the new broker configuration(localhost:9003) and how it will connect to the new leaders and start producing data again.
Kafka clients are receiving the necessary meta information from the cluster automatically on each request when reading from or writing to a topic in case of a leadership change.
In general, the client sends a (read/write) request to one of the bootstrap server, listed in the configuration bootstrap.servers. This initial request (hence called bootstrap) returns the details on which broker the topic partition leader is located so that the client can communicate directly with that broker. Each individual broker contains all meta information for the entire cluster, meaning also having the knowledge on the partition leader of other brokers.
Now, if one of your broker goes down and the leadership of a topic partition switches, your producer will get notified about it through that mechanism.
There is a KafkaProducer configuration called metadata.max.age.ms which you can modify to update metadata on your producer even if there is no leadership change happening:
"Controls how long the producer will cache metadata for a topic that's idle. If the elapsed time since a topic was last produced to exceeds the metadata idle duration, then the topic's metadata is forgotten and the next access to it will force a metadata fetch request."
Just a few notes on your question:
The term "Kafka broker cluster" does not really exists. You have a Kafka cluster containing one or multiple Kafka brokers.
You do not have a broker as a "primary(leader) for the cluster" but you have for each TopicPartition a leader. Maybe you mean the Controller which is located on one of the brokers within your cluster.

Does zookeeper returns Kafka brokers dns? (which has the leader partition)

I'm new to Kafka and I want to ask a question.
If there are 3 Kafka brokers(kafka1, kafka2, kafka3)(they are in same Kafka Cluster)
and topic=test(replication=2)
kafka1 has leader partition and kafka2 has follower partition.
If producer sends data to kafka3, then how do data stored in kafka1 and Kafka2?
I heard that, if producer sends data to kafka3, then the zookeeper finds the broker who has the leader partition and returns the broker's dns or IP address.
And then, producer will resend to the broker with metadata.
Is it right? or if it's wrong, plz tell me how it works.
Thanks a lot!
Every kafka topic partition has its own leader. So if you have 2 partitions , kafka would assign leader for each partition. They might end up being same kafka nodes or they might be different.
When producer connects to kafka cluster, it gets to know about the the partition leaders. All writes must go through corresponding partition leader, which is responsible to keep track of in-sync replicas.
All consumers only talk to corresponding partition leaders to get data.
If partition leader goes down , one of the replicas become leader and all producers and consumers are notified about this change

how producers find kafka reader

The producers send messages by setting up a list of Kafka Broker as follows.
props.put("bootstrap.servers", "127.0.0.1:9092,127.0.0.1:9092,127.0.0.1:9092");
I wonder "producers" how to know that which of the three brokers knew which one had a partition leader.
For a typical distributed server, either you have a load bearing server or have a virtual IP, but for Kafka, how is it loaded?
Does the producers program try to connect to one broker at random and look for a broker with a partition leader?
A Kafka cluster contains multiple broker instances. At any given time, exactly one broker is the leader while the remaining are the in-sync-replicas (ISR) which contain the replicated data. When the leader broker is taken down unexpectedly, one of the ISR becomes the leader.
Kafka chooses one broker’s partition’s replicas as leader using ZooKeeper. When a producer publishes a message to a partition in a topic, it is forwarded to its leader.
According to Kafka documentation:
The partitions of the log are distributed over the servers in the
Kafka cluster with each server handling data and requests for a share
of the partitions. Each partition is replicated across a configurable
number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
You can find topic and partition leader using this piece of code.
EDIT:
The producer sends a meta request with a list of topics to one of the brokers you supplied when configuring the producer.
The response from the broker contains a list of partitions in those topics and the leader for each partition. The producer caches this information and therefore, it knows where to redirect the messages.
It's quite an old question but I have the same question and after researched, I want to share the answer cuz I hope it can help others.
To determine leader of a partition, producer uses a request type called a metadata request, which includes a list of topics the producer is interested in.
The broker will response specifies which partitions exist in the topics, the replicas for each partition, and which replica is the leader.
Metadata requests can be sent to any broker because all brokers have a metadata cache that contains this information.

Apache Kafka Topic Partition Message Handling

I'm a bit confused on the Topic partitioning in Apache Kafka. So I'm charting down a simple use case and I would like to know what happens in different scenarios. So here it is:
I have a Topic T that has 4 partitions TP1, TP2, TP4 and TP4.
Assume that I have 8 messages M1 to M8. Now when my producer sends these messages to the topic T, how will they be received by the Kafka broker under the following scenarios:
Scenario 1: There is only one kafka broker instance that has Topic T with the afore mentioned partitions.
Scenario 2: There are two kafka broker instances with each node having same Topic T with the afore mentioned partitions.
Now assuming that kafka broker instance 1 goes down, how will the consumers react? I'm assuming that my consumer was reading from broker instance 1.
I'll answer your questions by walking you through partition replication, because you need to learn about replication to understand the answer.
A single broker is considered the "leader" for a given partition. All produces and consumes occur with the leader. Replicas of the partition are replicated to a configurable amount of other brokers. The leader handles replicating a produce to the other replicas. Other replicas that are caught up to the leader are called "in-sync replicas." You can configure what "caught up" means.
A message is only made available to consumers when it has been committed to all in-sync replicas.
If the leader for a given partition fails, the Kafka coordinator will elect a new leader from the list of in-sync replicas and consumers will begin consuming from this new leader. Consumers will have a few milliseconds of added latency while the new leader is elected. A new coordinator will also be elected automatically if the coordinator fails (this adds more latency, too).
If the topic is configured with no replicas, then when the leader of a given partition fails, consumers can't consume from that partition until the broker that was the leader is brought back online. Or, if it is never brought back online, the data previously produced to that partition will be lost forever.
To answer your question directly:
Scenario 1: if replication is configured for the topic, and there exists an in-sync replica for each partition, a new leader will be elected, and consumers will only experience a few milliseconds of latency because of the failure.
Scenario 2: now that you understand replication, I believe you'll see that this scenario is Scenario 1 with a replication factor of 2.
You may also be interested to learn about acks in the producer.
In the producer, you can configure acks such that the produce is acknowledged when:
the message is put on the producer's socket buffer (acks=0)
the message is written to the log of the lead broker (acks=1)
the message is written to the log of the lead broker, and replicated to all other in-sync replicas (acks=all)
Further, you can configure the minimum number of in-sync replicas required to commit a produce. Then, in the event when not enough in-sync replicas exist given this configuration, the produce will fail. You can build your producer to handle this failure in different ways: buffer, retry, do nothing, block, etc.

Kafka producer message flow

I have a topic called Topic1 with two partitions. Assume that serverA is leader for Topic1, partition1 and serverB is a follower.
What will happen if in my client I publish to serverB (in the broker.list I specify only serverB)? How does the message propagate? Is it sent to serverB and then to serverA.
I've found this document to be very helpful in explaining what's going on internally with Kafka.
There's an API used by the producer to ask any one Kafka server for the list of partitions and all metadata for those partitions. That metadata includes the leader broker for each partition. The producer calls the partitioner to get the target partition and then talks directly with that partition's Kafka broker leader to write the message onto the partition. The leader will handle communications with any other brokers managing replicas of the partition.
To publish a message to a partition, the client first finds the leader of the partition from Zookeeper and sends the message to the leader.
The leader writes the message to its local log. Each follower constantly pulls new messages from the leader using a single socket channel.The follower writes each received message to its own log and sends an acknowledgment back to the leader. Once the leader receives the acknowledgment from all replicas in ISR, the message is committed
So to answer your question, if the client publishes to serverB, it consults zookeeper to know the leader for Topic1 and partition1. Zookeeper responds saying serverA is the leader for partition1. So the client sends the message to serverA.(Here I assumed that the partitioner will send the message to partition1.)
All these are handled by the kafka producer. End user application need not worry about these details.
You can read more about this here