We have kafka cluster with 3 kafka brokers nodes and 3 zookeepers servers
kafka version - 10.1 ( hortonworks )
from my understanding since all meta data is located on the zookeeper servers , and kafka brokers are using this data ( kafka talk with zookeeper server via port 2181 )
I just wondering if each kafka machine talk with other kafka in the cluster , or maybe kafka are get/put the data only on/from the zookeepers servers ?
So dose kafka service need to communicate with other kafka in the cluster ? ,
Or maybe kafka machines get all is need only from the zookeepers server ?
Kafka brokers certainly need to communicate with each other, most importantly to replica data. Data produced to Kafka is replicated across brokers for fault-tolerance and data durability. Partition followers send FetchRequests to partition leaders in order to replicate the data.
Additionally, the Controller broker sends a LeaderAndIsr request to brokers whenever a partition leader/follower is changed - that's how it informs brokers to start leading a partition or replicating it.
I would recommend these two introductory articles of mine in order to help you get more context:
https://hackernoon.com/thorough-introduction-to-apache-kafka-6fbf2989bbc1
https://hackernoon.com/apache-kafkas-distributed-system-firefighter-the-controller-broker-1afca1eae302
Related
I have just started learning kafka and continuously I am coming across a term bootstrap-server.
Which server does it represent in my kafka cluster?
It is the url of one of the Kafka brokers which you give to fetch the initial metadata about your Kafka cluster.
The metadata consists of the topics, their partitions, the leader brokers for those partitions etc.
Depending upon this metadata your producer or consumer produces or consumes the data.
You can have multiple bootstrap-servers in your producer or consumer configuration. So that if one of the broker is not accessible, then it falls back to other.
We know that a kafka cluster can have 100s or 1000nds of brokers (kafka servers). But how do we tell clients (producers or consumers) to which to connect? Should we specify all 1000nds of kafka brokers in the configuration of clients? no, that would be troublesome and the list will be very lengthy. Instead what we can do is, take two to three brokers and consider them as bootstrap servers where a client initially connects. And then depending on alive or spacing, those brokers will point to a good kafka broker.
So bootstrap.servers is a configuration we place within clients, which is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself.
A host and port pair uses : as the separator.
localhost:9092
localhost:9092,another.host:9092
So as mentioned, bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster.
Special Notes:
Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list does not have to contain the full set of servers (you may want more than one, though, in case a server is down).
Clients (producers or consumers) make use of all servers irrespective of which servers are specified in bootstrap.servers for bootstrapping.
bootstrap.servers is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself.
Kafka broker
A Kafka cluster is made up of multiple Kafka Brokers. Each Kafka Broker has a unique ID (number). Kafka Brokers contain topic log partitions. Connecting to one broker bootstraps a client to the entire Kafka cluster. For failover, you want to start with at least three to five brokers. A Kafka cluster can have, 10, 100, or 1,000 brokers in a cluster if needed.
more information: check this, official doc
So my question is this: If i have a server running Kafka (And zookeeper), and another machine only consuming messages, does the consumer machine need to run zookeeper too? Or does the server take care of all?
No.
Role of Zookeeper in Kafka is:
Broker registration: (cluster membership) with heartbeats mechanism to keep the list current
Storing topic configuration: which topics exist, how many partitions each
has, where are the replicas, who is the preferred leader, list of ISR for
partitions
Electing controller: The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions.
So Zookeeper is required only for kafka broker. There is no need to have Zookeper on the producer or consumer side.
The consumer does not need zookeeper
You have not mentioned which version of Kafka or the clients you're using.
Kafka consumers using 0.8 store their offsets in Zookeeper, so it is required for them. However, no, you would not run Zookeeper and consumers on the same server
From 0.9 and later, clients are separate from needing it (unless you want to manage external connections to Zookeeper on your own for storing data)
we have with 3 kafka machine and 3 zookeper servers
while kafka machines are not co-hosted with zookeper server ( kafka are on different machines , OS is redhat version 7.x )
in order to get the brokers id , we do the following on the zookeper servers
cd /usr/hdp/current/zookeeper-server/bin
./zkCli.sh
ls /brokers/ids
results should be the three brokers id's as
1011 1012 1013
my question is - in which way zookeper know that broker is up?
or to be more specific
which cli zookeper execute in order to identify that kafka broker is up ?
Zookeeper is basically a distributed key-value store. Upon startup, a Kafka broker connects to Zookeeper (using the zookeeper.connect setting) and create a znode (a key-value pair) with its own broker.id under /brokers/ids. Kafka brokers then stay connected to Zookeeper while they are running.
The znode is created as "Ephemeral" (this is a feature of Zookeeper). It means that Zookeeper will delete it if the broker disconnects.
This way, Zookeeper knows at any time which brokers are alive (it does not necessarily mean the broker is healthy!). This is used by brokers to discover the other brokers in a cluster.
I've configured a cluster of Kafka brokers and a cluster of Zk instances using kafka_2.11-1.1.0 distribution archive.
For Kafka brokers I've configured config/server.properties
broker.id=1,2,3
zookeeper.connect=box1:2181,box2:2181,box3:2181
For Zk instances I've configured config/zookeeper.properties:
server.1=box1:2888:3888
server.2=box3:2888:3888
server.3=box3:2888:3888
I've created a basic producer and a basic consumer and I don't know why I am able to write messages / read messages even if I shut down all the Zookeeper
instances and have all the Kafka brokers up and running.
Even booting up new consumers, producers works without any issue.
I thought having a quorum of Zk instances is a vital point for a Kafka cluster.
For both consumer and producer, I've used following configuration:
bootrapServers=box1:9092,box2:9092,box3:9092
Thanks
I thought having a quorum of Zk instances is a vital point for a Kafka cluster.
Zookeeper quorum is vital for managing partition lists, leaders, etc. In general, ZK is necessary for management that is done by the cluster coordinator in the cluster.
Basically, right now (with ZK down), you cannot modify topics (as the partition metadata is stored in ZK), start up / shut down brokers (as they use ZK for discovery) and other similar operations.
Even booting up new consumers, producers works without any issue.
Producer/consumer operations reach out to brokers only. The broker instance can still append to the log, and can still communicate with other brokers to have replication. So it is possible to send a message, get it received by broker and saved to disk, with other brokers replicating (as they are continuously sending fetch requests to the leader (and they know who this partition's leader is because they saved that data when ZK was still running)).
I am trying to create a tool that can kill kafka producers and consumers randomly in order to simulate production environment. Is there a way to find the active producers and consumers in a kafka cluster? I want to exactly know which thread in which host is acting as the producer or the consumer?
I started by getting the I.P. of at least one kafka broker from the HDP cluster.
I checked the open connections on the kafka broker on its specified port (by default it is 6667) and retrieved the IP addresses of the connected machines.
Using their IP Addresses, I found out the processes that are connected to the kafka broker.
Thus, I figured out the machines that are kafka producers and consumers.