Zookeeper install via Ambari - apache-zookeeper

Performing install via Ambari 1.7 and would like to get some clarification regarding the Zookeeper installation. The setup involves (3) Zookeeper and (3) Kafka instances.
Ambari UI asks to specify Zookeeper master(s) and Zookeeper clients/slaves. Should I choose all three Zookeeper nodes as masters and install Zookeeper client on each Kafka server?
Zookeeper doesn't have any master node(s) and I am a little confused here with this Ambari master/slave terminology.

Zookeeper Server is considered a MASTER component in Ambari terminology. Kafka has the requirement that Zookeeper Server be installed on at least one node in the cluster. Thus the only requirement you have is to install Zookeeper server on one of the nodes in your cluster for Kafka to function. Kafka does not require Zookeeper clients on each Kafka node.
You can determine all this information by looking at the Service configurations for KAFKA and ZOOKEEPER. The configuration is specified in the metainfo.xml file for each component under the stack definition. The location of the definitions will differ based on the version of Ambari you have installed.
On newer versions of Ambari this location is:
/var/lib/ambari-server/resources/common-services/<service name>/<service version>
On older version of Ambari this location is:
/var/lib/ambari-server/resources/stacks/HDP/<stack version>/services/<service name>

Related

How to set kafka schema registry cluster

I have set up zookeeper and kafka broker cluster. I want to setup multiple schema registry cluster for fail over.
Zookeeper cluster having 3 node
kafka broker cluster having 3 node.
Could you please mention details steps how to set multiple schema registry?
I am using confluent 5.0 version
Schema Registry is designed to work as a distributed service using single master architecture, so at any given time there will be only one master and rest of the nodes refer back to it. You can refer the schema-registry arch here
You can choose 3 nodes schema-registry cluster (you can run on the same nodes along with zookeeper/Kafka), As you are using confluent 5.0, you can use the confluent CLI,
confluent start schema-registry
Update the schema-registry.properties,
#zookeeper urls
kafkastore.connection.url=zookeeper-1:2181,zookeeper-2:2181,...
#make every node eligible to become master for failover
master.eligibility=true
On the consumer and producer side, pass the list of schema-registry urls in the Consumer.props & Produce.props
props.put("schema.registry.url","http://schemaregistry-1:8081,http://schemaregistry-2:8081,http://schemaregistry-3:8081")
*By default schema-registry port will be 8081.
Hope this helps.

2 cluster of zookeper servers in hadoop+kafka cluster - is it posible?

We have Kafka cluster with the following details
3 kafka machines
3 zookeeper servers
We also have Hadoop cluster that includes datanode machines
And all application are using the zookeeper servers, including the kafka machines
Now
We want to do the following changes
We want to add additional 3 zookeeper servers that will be in a separate cluster
And only kafka machine will use this additional zookeeper servers
Is it possible ?
Editing the ha.zookeeper.quorum in Hadoop configurations to be separate from zookeeper.connect in Kafka configurations, such that you have two individual Zookeeper clusters, can be achieved, yes.
However, I don't think Ambari or Cloudera Manager, for example, allow you to view or configure more than one Zookeeper cluster at a time.
Yes, that's possible. Kafka uses Zookeeper to perform various distributed coordination tasks, such as deciding which Kafka broker is responsible for allocating partition leaders, and storing metadata on topics in the broker.
After closing kafka, the original zookeeper cluster data will be copied to the new cluster using tools, this is a zookeeper cluster data transfer util zkcopy
But if your Kafka cluster didn't stop work, you should think about Zookeeper data transfer to additional zookeeper servers.

Kafka connect cluster setup or launching connect workers

I am going through kafka connect, and i am trying to get the concepts.
Let us say I have kafka cluster (nodes k1, k2 and k3) setup and it is running, now i want to run kafka connect workers in different nodes say c1 and c2 in distributed mode.
Few questions.
1) To run or launch kafka connect in distributed mode I need to use command ../bin/connect-distributed.sh, which is available in kakfa cluster nodes, so I need to launch kafka connect from any one of the kafka cluster nodes? or any node from where I launch kafka connect needs to have kafka binaries so that i will be able to use ../bin/connect-distributed.sh
2) I need to copy the my connector plugins to any kafka cluster node( or to all cluster nodes?) from where I do the step 1?
3) how does kafka copies these connector plugins to worker node before starting jvm process on the worker node? because the plugin is the one which has my task code and it needs to be copied to worker in order to start the process in worker.
4) Do i need to install anything in connect cluster nodes c1 and c2, like need to install java or any kafka connect related?
5) In some places it says use confluent platform but i would like to start it with apache kafka connect alone first.
can some one please through some light or even pointer to some resources would also help.
Thank you.
1) In order to have a highly available kafka-connect service you need to run at least two instances of connect-distributed.sh on two distinct machines that have the same group.id. You can find more details regarding the configuration of each worker here. For improved performance, Connect should be ran independently of the broker and Zookeeper machines.
2) Yes, you need to place all your connectors under plugin.path (normally under /usr/share/java/) on every machine that you are planning to run kafka-connect.
3) kafka-connect will load the connectors on startup. You don't need to handle this. Note that if your kafka-connect instance is running and a new connector is added, you need to restart the service.
4) You need to have Java installed on all your machines. For Confluent Platform particularly:
Java 1.7 and 1.8 are supported in this version of Confluent Platform
(Java 1.9 is currently not supported). You should run with the
Garbage-First (G1) garbage collector. For more information, see the
Supported Versions and Interoperability.
5) It depends. Confluent was founded by the original creators of Apache Kafka and it comes as a more complete distribution adding schema management, connectors and clients. It also comes with KSQL which is quite useful if you need to act on certain events. Confluent simply adds on top of the Apache Kafka distribution, it's not a modified version.
Answer given by Giorgos is correct. I ran few connectors and now I understand it better.
I am just trying to put it differently.
In Kafka connect there are two things involved one is Worker and second is connector.Below is on details about running distributed Kafka connect.
Kafka connect Worker is a Java process on which the connector/connect task will run. So first thing is we need to launch worker, to run/launch a worker we need java installed on that machine then we need Kafka connect related sh/bat files to launch worker and kafka libs which will be used by kafka connect worker, for this we will just simply copy/install Kafka in the worker machine, also we need to copy all the connector and connect-task related jars/dependencies in "plugin.path" as defined in the below worker properties file, now worker machine is ready, to start worker we need to invoke ./bin/connect-distributed.sh ./config/connect-distributed.properties, here connect-distributed.properties will have configuration for worker. The same thing has to be repeated in each machine where we need to run Kafka connect.
Now the worker java process is running in all machines, the woker config will have group.id property, the workers which have this same property value will be forming a group/cluster of workers.
Each worker process will expose rest endpoint (default http://localhost:8083/connectors), to launch/start a connector on the running workers, we need do http-post a connector config json, based on the given config the worker will start the connector and the number of tasks in the above group/cluster workers.
Example: Connect post,
curl -X POST -H "Content-Type: application/json" --data '{"name": "local-file-sink", "config": {"connector.class":"FileStreamSinkConnector", "tasks.max":"3", "file":"test.sink.txt", "topics":"connect-test" }}' http://localhost:8083/connectors

Can I query any zookeeper node to get any data?

I have a small zookeeper cluster of 3 nodes. I also have another software that needs to be configured to talk to zookeeper, also running in a cluster of 3 nodes, on the same host.
I don't know anything about how zookeeper works. Do I have to configure this other software to talk to all hosts, or should it work to just configure it to talk to localhost zookeeper?
Put another way, can a query to any zookeeper node to get any data?
If you had a ZooKeeper cluster, so you can query to any ZooKeeper node and get eventually consistent data.
For how ZooKeeper works you can check this awesome post here:Explaining Apache ZooKeeper
A lots of good projects use ZooKeeper as a backbone: HBase, Kafka, please Google it, and learn from those projects for more digest.

Configure zookeeper for kafka

I want to install kafka in my centos 6.5 machine. In kafka installation tutorial, I came to know that it needs zookeeper to run. I have already install hbase which also uses zookeeper service internally and zookeeper service only starts when I start hbase service.
So in order to install kafka, do I need install zookeeper separately? Please suggest.
Kafka is designed to use zookeeper by default. If you have already installed zookeeper in your system, you can create a bash script to start the zookeeper whenever you start the kafka. On your zookeeper installation directory there should be zkServer.sh start (to start zookeeper) and in kafka installation directory kafka-server-start.sh (to start the kafka).
Kafka architecture works best with distributed platform, if you are experimenting with sudo cluster, you can look for alternative message brokers like HiveMQ or RabbitMQ.
You can look further discussions at: Kafka: Is Zookeeper a must?
Installing the zookeeper cluster is the best practice. You can use it for hbase and kafka.(just define the different root dir in zk)