How to retrieve zookeeper host details from Kafka brokerslist - apache-kafka

I have list of Brokers for my Kafka cluster. How can I get the zookeeper host using Brokerslist?

If I got your question right you want to register your brokers at a zookeeper cluster. This actually works the other way round: You have to tell each broker where your zookeeper-server (or cluster) can be found. Have a look at the broker configuration setting zookeeper.connect. Together with the broker.id it will register each broker at the zookeeper cluster.
Example:
broker.id=1
zookeeper.connect=zk-host-1:2181,zk-host-2:2181,zk-host-3:2181
Hope that answers your question.

You cannot.
Zookeeper is intended to be abstracted away. There is no such API or method to get Zookeepers connected to a broker.
You'll need to SSH to a broker in that list (which you could do from Java}

Related

Kafka Static IP and Service Discovery

I have a three node Kakfa cluster that also has a three node Zookeeper ensemble managing it. My configuration for this cluster looks like
Node 1
IP - 192.168.1.11
Kafka Port - 9092
Zookeeper Port - 2181
Node 2
IP - 192.168.1.12
Kafka Port - 9092
Zookeeper Port - 2181
Node 3
IP - 192.168.1.13
Kafka Port - 9092
Zookeeper Port - 2181
For each of these nodes I have both the Zookeeper and Kakfa configuration files. My sample Zookeeper config file looks like
# Zookeeper server config
dataDir=/tmp/zookeeper
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.1.11:2889:3889
server.2=192.168.1.12:2889:3889
server.3=192.168.1.13:2889:3889
since each Zookeeper instance needs to know about each other Zookeeper instance and generally from what I have seen, even when managing massive Kafka clusters, there is usually less than 10 Zookeeper nodes. So here we would only need to keep track of 10 IPs. Also from my understanding, these IPs are not as volatile and usually do not change often if ever.
For my Kafka configuration file I have the following on each node
# Kafka server properties file
broker.id=<ID for this node>
log.dirs=/tmp/kafka-logs
zookeeper.connect=192.168.1.11:2181,192.168.1.12:2181,192.168.1.13:2181
zookeeper.connection.timeout.ms=36000
listeners=PLAINTEXT://<IP of this node>:9092
Now it makes sense to me that each Kafka node we introduce into our cluster has to be aware of all the Zookeeper nodes so it can be managed. But the issue for me is that as we scale the Kafka nodes up or down, we are less certain about their IPs. For example, if I wanted to create a new Kafka topic, I would use the kafka-topics.sh shell file that they provide and type something like
kafka-topics.sh --create --topic MyTopic --bootstrap-server <IP of one of the Kafka nodes>
# Could also use the broker-list option instead of bootstrap-server to allow multiple IPs
The problem for me is, we never know which Kafka IPs are up and running, so passing the IPs to --bootstrap-server seems like a guessing game, or I need to manually check a working node for its IP.
So for Kafka, how do I configure a static IP (maybe virtual IP?) so that other services that use my Kafka cluster can always connect to it? How do I perform service discovery for a cluster with changing IPs?
there is usually less than 10 Zookeeper nodes
According to Kafka Definitely Guide, 7 is generally the max size of a Zookeeper cluster for large Kafka clusters. Personally, I've not seen more than 5 on a Kafka cluster serving millions of events a day...
You could make a DNS record that resolves to the healthy instances
However, if IPs aren't static, then clients, in general, would have issues because partition leaders are hosted by IP and broker ID. If an ID moves to a new IP or an IP no longer resolves to a (healthy) Kafka broker, your clients start experiencing errors
Note: both bootstrap-server and broker-list accept multiple addresses, but only the console producer uses broker-list param
There are also other ways to create topics, such as Terraform where you could statically store the Kafka addresses as a variable in source code and rarely ever change it. In particular, you don't need to list every IP each time you use a Kafka client, only a handful

Kafka broker setup

To connect to a Kafka cluster I've been provided with a set of bootstrap servers with name and port :
s1:90912
s2:9092
s3:9092
Kafka and Zookeeper are running on the instance s4. From reading https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-bootstrap-servers.html, it states:
bootstrap server is a comma-separated list of host and port pairs that
are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster
that a Kafka client connects to initially to bootstrap itself.
I reference the above bootstrap server definition as I'm trying to understand the relationship between the kafka brokers s1,s2,s3 and kafka,zookeeper running on s4.
To connect to the Kafka cluster, I set the broker to a CSV list of 's1,s1,s3'. When I send messages to the CSV list of brokers, to verify the messages are added to the topic, I ssh onto the s4 box and view the messages on the topic.
What is the link between the Kafka brokers s1,s2,s3 and s4? I cannot ssh onto any of the brokers s1,s2,s3 as these brokers do not seem accessible using ssh, should s1,s2,s3 be accessible?
The individual responsible for the setup of the Kafka box is no longer available, and I'm confused as to how this configuration works. I've searched for config references of the brokers s1,s2,s3 on s4 but there does not appear to be any configuration.
When Kafka is being set up and configured what allows the linking between the brokers (in this case s1,s2,s3) and s4?
I start Kafka and Zookeeper on the same server, s4.
Should Kafka and Zookeeper also be running on s1,s2,s3?
What is the link between the Kafka brokers s1,s2,s3 and s4?
As per the Kafka documentation about adding nodes to a cluster, each server must share the same zookeeper.connect string and have a unique broker.id to be part of the cluster.
You may check which nodes are in the cluster via zookeeper-shell with an ls /brokers/ids, or via the Kafka AdminClient API, or kafkacat -L
should s1,s2,s3 be accessible?
Via SSH? They don't have to be.
They should respond to TCP connections from your Kafka client machines on their Kafka server ports, though
Should Kafka and Zookeeper also be running on s1,s2,s3?
You should not have 4 Zookeeper servers in a cluster (odd numbers, only)
Otherwise, you've at least been given some ports for Kafka on those machines, therefore Kafka should be

Is it possible to produce to a kafka topic when only 1 of the brokers is reachable?

Is it possible to produce to a Kafka topic when only 1 of the brokers is reachable from the producer, none of the zookeeper nodes are reachable from the producer, but all of the brokers are healthy and are reachable from each other?
For example, this would be required if I were to produce messages via an SSH tunnel. If this were for a temporary push I could possibly create the topic with replication factor 1 and have all partitions assigned to the broker in question, and reassign the partitions after the fact, but I'm hoping there is a more flexible setup.
This is all using the java client.
Producers don't interact with Zookeeper so it's not an issue.
The only requirement for Producers is to be able to connect to the brokers that are leaders for the partitions they want to use.
If the broker you connect to is the leader for the partitions you want to use, then yes you can produce to it.
Otherwise it's not going to work. Also creating a topic may not help as its partitions could be assigned to any brokers. Also in order to create a topic, a client has to connect to the controller which may not be the broker you can reach.
If you can only connect to 1 "thing", you may want to consider using something like a REST Proxy. Your "isolated" environment could send REST requests to the proxy which is able to connect to all brokers in the cluster.

Does kafka client connect to zookeeper or is it behind the scene

Kafka client code directly refers to the broker ip and port and in case if it is down will zookeeper direct to another broker. is zookeper always behind the scene
In the case you provide only one broker address in the client code, and it goes down, plus your client restarts, then your client will also be down. Zookeeper will not be used here because the broker will not be reachable.
If you give more than one broker address in the client, then it's more resilient in that the Kafka Controller process periodically fetches a list of all alive brokers in the cluster from Zookeeper and is responsible for sending that information back to the clients via the leader of the partitions they get assigned. Zookeeper is indirectly used here, but does not communicate with any external clients
If I got the question in the right way the answer is no.
The Kafka clients need connection only to Kafka brokers and Zookeeper isn't involved at all. Clients needs to write/read leader partitions on brokers.
If the Kafka brokers set in the brokers list aren't available, the clients can connect and cannot start to send/receive messages.
Only in the old version 0.8.0 the Zookeeper was involved for consumers which saved offset on Zookeeper. Starting from 0.9.0, the consumers save offset in Kafka topics so Zookeeper isn't needed anymore.

Kafka and zookeeper dependancies

My company is about to introduce kafka. However, i was not able to conprehend why either zookeeper or kafka confinguration, does not require to specify one or another existence.
For example, i neither find definition of kafka ip in zookeeper nor in kakfa definition of zookeeper ip in their config.
Can someone explain ?
for kafka server you should have server.properties file. It contains property zookeeper.connect
official documentation: https://kafka.apache.org/documentation/#brokerconfigs