Why does kafka producer take a broker endpoint when being initialized instead of the zk - apache-kafka

If I have multiple brokers, which broker should my producer use? Do I need to manually switch the broker to balance the load? Also why does the consumer only need a zookeeper endpoint instead of a broker endpoint?
quick example from tutorial:
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning

which broker should my producer use? Do I need to manually switch the broker to balance the load?
Kafka runs on cluster, meaning set of nodes, so while producing anything you need to tell him the LIST of brokers that you've configured for your application, below is a small note taken from their documentation.
“metadata.broker.list” defines where the Producer can find a one or more Brokers to determine the Leader for each topic. This does not need to be the full set of Brokers in your cluster but should include at least two in case the first Broker is not available. No need to worry about figuring out which Broker is the leader for the topic (and partition), the Producer knows how to connect to the Broker and ask for the meta data then connect to the correct Broker.
Hope this clear some of your confusion
Also why does the consumer only need a zookeeper endpoint instead of a
broker endpoint
This is not technically correct, as there are two types of APIs available, High level and Low level consumer.
The high level consumer basically takes care of most of the thing like leader detection, threading issue, etc. but does not provide much control over messages which exactly the purpose of using the other alternatives Simple or Low level consumer, in which you will see that you need to provide the brokers, partition related details.
So Consumer need zookeeper end point only when you are going with the high level API, in case of using Simple you do need to provide other information

Kafka sets a single broker as the leader for each partition of each topic. The leader is responsible for handling both reads and writes to that partition. You cannot decide to read or write from a non-Leader broker.
So, what does it mean to provide a broker or list of brokers to the kafka-console-producer ? Well, the broker or brokers you provide on the command-line are just the first contact point for your producer. If the broker you list is not the leader for the topic/partition you need, your producer will get the current leader info (called "topic metadata" in kafka-speak) and reconnect to other brokers as necessary before sending writes. In fact, if your topic has multiple partitions it may even connect to several brokers in parallel (if the partition leaders are different brokers).
Second q: why does the consumer require a zookeeper list for connections instead of a broker list? The answer to that is that kafka consumers can operate in "groups" and zookeeper is used to coordinate those groups (how groups work is a larger issue, beyond the scope of this Q). Zookeeper also stores broker lists for topics, so the consumer can pull broker lists directly from zookeeper, making an additional --broker-list a bit redundant.

Kafka Producer API does not interact directly with Zookeeper. However, the High Level Consumer API connects to Zookeeper to fetch/update the partition offset information for each consumer. So, the consumer API would fail if it cannot connect to Zookeeper.

All above answers are correct in older versions of Kafka, but things have changed with arrival of Kafka 0.9.
Now there is no longer any direct interaction with zookeeper from either the producer or consumer. Another interesting things is with 0.9, Kafka has removed the dissimilarity between High-level and Low-level APIs, since both follows a uniformed consumer API.

Related

Kafka, questions about setting up

I'm testing Kafka on Linux, but I don't know what's wrong because the test results are different from what I understand.
Let me explain the setting.
Currently, three brokers were configured with kafka version 2.8.1 in centos7 using 9092, 9093, and 9094 ports, respectively.
In the case of producers, all three ports were connected to the bootstrap-server setting and then executed.
kafka-console-producer.bat --broker-list localhost:9092,localhost:9093,localhost:9094 --topic test
In the case of consumers, three were set up so that they could be attached to each of the three ports.
1. kafka-console-consumer.bat --bootstrap-serverlocalhost:9092 --topic test
2. kafka-console-consumer.bat --bootstrap-serverlocalhost:9093 --topic test
3. kafka-console-consumer.bat --bootstrap-serverlocalhost:9094 --topic test
If I were to explain what I understood here,
In the case of Kafka, the leader broker acts as a controller, and the follow brokers copy the leader broker's data.
If one of the follower brokers dies, a disconnection message simply appears on the consumer connected to the broker.Other brokers operate normally.
If a leader broker dies, one of the follow brokers will be changed to a leader broker, and the changed leader broker will act as a controller.
If I were to explain the problem,
If you kill a leader broker, check the describe option, and the other follow broker has changed to a leader, but both producers and consumers cannot find a new leader and fail.
Even if a broker running on 9092 ports kills the broker without being a leader, the producer and consumer will fail.
Question.
If the leader broker dies, should the producer and consumer also set up a new connection?
Am I misunderstanding the producer and consumer?
Is there anything wrong with the setting?
I'm testing Kafka on Linux
But you're using Batch files, and connecting to localhost, which are for Windows...
so that they could be attached to each of the three ports.
This isn't how Kafka distributes load. You can only have one consumer thread active per topic partition. Unclear how many partitions your topic has, but if you have only one and that specific broker died (it is the only replica and leader), this explains why your clients would stop working.
Besides that, Kafka is generally on the same port, on mulitple hosts. Using one host is not truly fault-tolerant, and is a waste of resources (CPU, RAM, and disk).
Regarding producers, there is a property for retries that can be configured; I'm not sure if the console producer overrides the default or not, but it should connect to the next available broker upon a new request.
For consumers, the same, however, you'll want to make sure your offsets.topic.replication.factor (and transactions topic factor, if you use them) is higher than 1; otherwise, consumers will be unable to read anything (or transactions will not work, which are enabled by default in newer versions)

Kafka Topic, Broker, ZooKeeper architecture overview

I have read a bunch of articles regarding Kafka architecture but I'm still brand-new in this and when it came to coding there was some confusion if I get the things correctly.
From what I understand Kafka server, broker and node are synonyms. There can be a few brokers within Kafka cluster. There is a Kafka topic (T1) and it consists of a few partitions (P1, P2..). These partitions can be replicated across the brokers (B1, B2..). B1 can be leader for P1, B2 for P2 and so on. Do we say that there is topic T1 defined for broker or cluster, and if we treat topic as set of partitions can we say 'topic replicas'?
From the official Kafka documentation:
bootstrap.servers: A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
So from what I understand, defining host1:port1,host2:port2 says that there are two brokers.
In this case, does ZooKeeper automatically distribute a message to a leader when executing bin/kafka-console-producer.sh --broker-list host1:port1,host2:port2 --topic test ? (I believe somewhere I have read that a producer should read broker id from ZooKeeper, but wouldn't it be unnecessary here?)
Is it equal to publishing using bin/kafka-console-producer.sh --zookeeper host1:z_port1,host2:z_port2 --topic test ?
How should I basically understand bin/kafka-configs.sh --zookeeper host1:z_port1,host2:z_port2? We have only one zookeeper instance?
Do we say that there is topic T1 defined for broker or cluster, and if we treat topic as set of partitions can we say 'topic replicas'?
1) Cluster. 2) Partitions are individually replicated across multiple brokers, often more than the replication factor itself. The more proper term would be the "in sync replicas (ISR)"
does ZooKeeper automatically distribute a message to a leader when executing
Zookeeper does not, no. Your client communicates with a Broker Controller, then receives all brokers in the cluster, which also returns metadata about which broker is the leader for which topic-partitions. The client then individually connects and produces to each leader broker for the calculated partitions
Is it equal to publishing
Producing*, yes.
We have only one zookeeper instance?
One Zookeeper cluster can manage multiple Kafka clusters via a feature called a chroot, the root directory in the Zookeeper znodes that contains information about the managed service.
Also, kafka-topics command can now use --bootstrap-server, not --zookeeper

Does Kafka broker store metadata?

Does Kafka broker store metadata which producer API uses (e.g. which partitions are leader for a topic etc.)? As per my understanding this metadata is stored in Zookeeper , is it correct? If it is true then how Brokers are updated by Zookeeper with latest information?
All Kafka brokers can answer a metadata request that describes the current state of the cluster: what topics there are, which partitions those topics have, which broker is the leader for those partitions etc.
ZooKeeper is responsible for:
Electing a controller broker - and making sure there is only one
Cluster membership - allowing brokers to join a cluster
Topic configuration - which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
Quotas - how much data is each client allowed to read and write
ACLs - who is allowed to read and write to which topic
There is regular communication between Kafka and ZooKeeper such that ZooKeeper knows a Kafka broker is still alive (ZooKeeper heartbeat mechanism) and also in response to events such as a topic being created or a replica falling out of sync for a topic-partition.
Kafka is a distributed system and is built to use Zookeeper which is responsible for controller election, topic configuration, clustering etc.
More precisely, Zookeeper initiates controller election. The controller broker is a single broker in the Kafka cluster which takes care of leader broker and followers for every partition. When a particular broker is taken down, the controller lets other replicas know (in order to handle partition leaders etc). Moreover, when the controller fails then Zookeeper initiates new elections in order to elect the new broker which will act as the controller.
Furthermore, Zookeeper knows which brokers are part of the Kafka cluster and which are still alive. Similarly, it is also aware of topic-specific information such as which topics exist, how many partitions each has, where are the replicas and so on.
Zookeeper also stores information regarding quotas and ACLs, i.e. what volume of data each client is allowed to consume/produce and also, who is allowed to consume or produce from a particular topic.

how producers find kafka reader

The producers send messages by setting up a list of Kafka Broker as follows.
props.put("bootstrap.servers", "127.0.0.1:9092,127.0.0.1:9092,127.0.0.1:9092");
I wonder "producers" how to know that which of the three brokers knew which one had a partition leader.
For a typical distributed server, either you have a load bearing server or have a virtual IP, but for Kafka, how is it loaded?
Does the producers program try to connect to one broker at random and look for a broker with a partition leader?
A Kafka cluster contains multiple broker instances. At any given time, exactly one broker is the leader while the remaining are the in-sync-replicas (ISR) which contain the replicated data. When the leader broker is taken down unexpectedly, one of the ISR becomes the leader.
Kafka chooses one broker’s partition’s replicas as leader using ZooKeeper. When a producer publishes a message to a partition in a topic, it is forwarded to its leader.
According to Kafka documentation:
The partitions of the log are distributed over the servers in the
Kafka cluster with each server handling data and requests for a share
of the partitions. Each partition is replicated across a configurable
number of servers for fault tolerance.
Each partition has one server which acts as the "leader" and zero or
more servers which act as "followers". The leader handles all read and
write requests for the partition while the followers passively
replicate the leader. If the leader fails, one of the followers will
automatically become the new leader. Each server acts as a leader for
some of its partitions and a follower for others so load is well
balanced within the cluster.
You can find topic and partition leader using this piece of code.
EDIT:
The producer sends a meta request with a list of topics to one of the brokers you supplied when configuring the producer.
The response from the broker contains a list of partitions in those topics and the leader for each partition. The producer caches this information and therefore, it knows where to redirect the messages.
It's quite an old question but I have the same question and after researched, I want to share the answer cuz I hope it can help others.
To determine leader of a partition, producer uses a request type called a metadata request, which includes a list of topics the producer is interested in.
The broker will response specifies which partitions exist in the topics, the replicas for each partition, and which replica is the leader.
Metadata requests can be sent to any broker because all brokers have a metadata cache that contains this information.

Apache Kafka Producer Broker Connection

I have a set of Kafka broker instances running as a cluster. I have a client that is producing data to Kafka:
props.put("metadata.broker.list", "broker1:9092,broker2:9092,broker3:9092");
When we monitor using tcpdump, I can see that only the connections to broker1 and broker2 are ESTABLISHED while for the broker3, there is no connection from my producer. I have a single topic with just one partition.
My questions:
How is the relation between number of brokers and topic partitions? Should I always have number of brokers = number of partitons?
Why in my case, I'm not able to connect to broker3? or atleast my network monitoring does not show that a connection from my Producer is established with broker3?
It would be great if I could get some deeper insight into how the connection to the brokers work from a Producer stand point.
Obviously, your producer does not need to connect to broker3 :)
I'll try to explain you what happens when you are producing data to Kafka:
You spin up some brokers, let's say 3, then create some topic foo with 2 partitions, replication factor 2. Quite simple example, yet could be a real case for someone.
You create a producer with metadata.broker.list (or bootstrap.servers in new producer) configured to these brokers. Worth mentioning, you don't necessarily have to specify all the brokers in your cluster, in fact you can specify only 1 of them and it will still work. I'll explain this in a bit too.
You send a message to topic foo using your producer.
The producer looks up its local metadata cache to see what brokers are leaders for each partition of topic foo and how many partitions does your foo topic have. As this is the first send to the producer, local cache contains nothing.
Producer sends a TopicMetadataRequest to each broker in metadata.broker.list sequentially until first successful response. That's why I mentioned 1 broker in that list would work as long as it's alive.
Returned TopicMetadataResponse will contain the information about requested topics, in your case it's foo and brokers in the cluster. Basically, this response contains the following:
list of brokers in the cluster, where each broker has an ID, host and port. This list may not contain the entire list of brokers in the cluster, but should contain at least the list of brokers that are responsible for servicing the subject topic.
list of topic metadata, where each entry has topic name, number of partitions, leader broker ID for each partition and ISR broker IDs for each partition.
Based on TopicMetadataResponse your producer builds up its local cache and now knows exactly that the request for topic foo partition 0 should go to broker X.
Based on number of partitions in a topic, producer partitions your message and accumulates it with the knowledge that it should be sent as a part of batch to some broker.
When the batch is full or linger.ms timeout passes, your producer flushes the batch to the broker. By "flushes" I mean "opens a new connection to a broker or reuses an existing one, and sends the ProduceRequest".
The producer does not need to open unnecessary connections to all brokers, as the topic you are producing to may not be serviced by some brokers, and your cluster could be quite large. Imagine a 1000 broker cluster with lots of topics, but one of topics has just one partition - you only need that one connection, not 1000.
In your particular case I'm not 100% sure why you have 2 open connections to brokers, if you have just a single partition, but I assume one connection was opened during metadata discovery and was cached for reusing, and the second one is the actual broker connection to produce data. However, I might be wrong in this case.
But anyway, there is no need at all to have a connection for the third broker.
Regarding your question about "Should I always have number of brokers = number of partitons?" the answer is most likely no. If you explain what you are trying to achieve, maybe I'll be able to point you to the right direction, but this is too broad to explain in general. I recommend reading this to clarify things.
UPD to answer the question in comment:
Metadata cache is updated in 2 cases:
If producer fails to communicate with broker for any reason - this includes the case when the broker is not reachable at all and when broker responds with an error (like "I'm not leader for this partition anymore, go away")
If no failures happen, the client still refreshes metadata every metadata.max.age.ms (https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/CommonClientConfigs.java#L42-L43) to discover new brokers and partitions itself.