I want to install kafka in my centos 6.5 machine. In kafka installation tutorial, I came to know that it needs zookeeper to run. I have already install hbase which also uses zookeeper service internally and zookeeper service only starts when I start hbase service.
So in order to install kafka, do I need install zookeeper separately? Please suggest.
Kafka is designed to use zookeeper by default. If you have already installed zookeeper in your system, you can create a bash script to start the zookeeper whenever you start the kafka. On your zookeeper installation directory there should be zkServer.sh start (to start zookeeper) and in kafka installation directory kafka-server-start.sh (to start the kafka).
Kafka architecture works best with distributed platform, if you are experimenting with sudo cluster, you can look for alternative message brokers like HiveMQ or RabbitMQ.
You can look further discussions at: Kafka: Is Zookeeper a must?
Installing the zookeeper cluster is the best practice. You can use it for hbase and kafka.(just define the different root dir in zk)
Related
In Kafka-manager github page it is written that:
The minimum configuration is the zookeeper hosts which are to be used
for kafka manager state. This can be found in the application.conf
file in conf directory. The same file will be packaged in the
distribution zip file; you may modify settings after unzipping the
file on the desired server.
kafka-manager.zkhosts="my.zookeeper.host.com:2181" You can specify
multiple zookeeper hosts by comma delimiting them, like so:
kafka-manager.zkhosts="my.zookeeper.host.com:2181,other.zookeeper.host.com:2181"
Alternatively, use the environment variable ZK_HOSTS if you don't want
to hardcode any values.
ZK_HOSTS="my.zookeeper.host.com:2181"
So my questions are:
Does Kafka-manager already contain Zookeeper when I download it?
Should I install Zookeeper for Kafka Manager seperately or use already installed Zookeeper used for Apache Kafka?
How many Zookeper instances are required for Kafka-Manager?
If I should install Zookeeper dedicated to Kafka-Manager, is it okey to install it in the same machine which Kafka-Manager installed or should I create another Zookeeper cluster in different machines?
I wonder what is the best practice?
Does Kafka-manager already contain Zookeeper when I download it?
No. It's just a web application. You can use the Zookeeper that's used by Kafka, though
That should answer the rest of your question...
if I already have Kafka running on premises, is Kafka Connect just a configuration on top of my existing Kafka, or does Kafka Connect require it's own Server/Environment separate from that of my existing Kafka?
Kafka Connect is part of Apache Kafka, but it runs as a separate process, called a Kafka Connect Worker. Except in a sandbox environment, you would usually deploy it on a separate machine/node from your Kafka brokers.
This diagram shows conceptually how it runs, separate from your brokers:
You can run Kafka Connect on a single node, or as part of a cluster (for throughput and redundancy).
You can read more here about installation and configuration and architecture of Kafka Connect.
Kafka Connect is its own configuration on top of your bootstrap-server's configuration.
For Kafka Connect you can choose between a standalone server or distributed connect servers and you'll have to update the corresponding properties file to point to your currently running Kafka server(s).
Look under {kafka-root}/config and you'll see
You'll basically update connect-standalone or connect-distributed properties based on your need.
I am going through kafka connect, and i am trying to get the concepts.
Let us say I have kafka cluster (nodes k1, k2 and k3) setup and it is running, now i want to run kafka connect workers in different nodes say c1 and c2 in distributed mode.
Few questions.
1) To run or launch kafka connect in distributed mode I need to use command ../bin/connect-distributed.sh, which is available in kakfa cluster nodes, so I need to launch kafka connect from any one of the kafka cluster nodes? or any node from where I launch kafka connect needs to have kafka binaries so that i will be able to use ../bin/connect-distributed.sh
2) I need to copy the my connector plugins to any kafka cluster node( or to all cluster nodes?) from where I do the step 1?
3) how does kafka copies these connector plugins to worker node before starting jvm process on the worker node? because the plugin is the one which has my task code and it needs to be copied to worker in order to start the process in worker.
4) Do i need to install anything in connect cluster nodes c1 and c2, like need to install java or any kafka connect related?
5) In some places it says use confluent platform but i would like to start it with apache kafka connect alone first.
can some one please through some light or even pointer to some resources would also help.
Thank you.
1) In order to have a highly available kafka-connect service you need to run at least two instances of connect-distributed.sh on two distinct machines that have the same group.id. You can find more details regarding the configuration of each worker here. For improved performance, Connect should be ran independently of the broker and Zookeeper machines.
2) Yes, you need to place all your connectors under plugin.path (normally under /usr/share/java/) on every machine that you are planning to run kafka-connect.
3) kafka-connect will load the connectors on startup. You don't need to handle this. Note that if your kafka-connect instance is running and a new connector is added, you need to restart the service.
4) You need to have Java installed on all your machines. For Confluent Platform particularly:
Java 1.7 and 1.8 are supported in this version of Confluent Platform
(Java 1.9 is currently not supported). You should run with the
Garbage-First (G1) garbage collector. For more information, see the
Supported Versions and Interoperability.
5) It depends. Confluent was founded by the original creators of Apache Kafka and it comes as a more complete distribution adding schema management, connectors and clients. It also comes with KSQL which is quite useful if you need to act on certain events. Confluent simply adds on top of the Apache Kafka distribution, it's not a modified version.
Answer given by Giorgos is correct. I ran few connectors and now I understand it better.
I am just trying to put it differently.
In Kafka connect there are two things involved one is Worker and second is connector.Below is on details about running distributed Kafka connect.
Kafka connect Worker is a Java process on which the connector/connect task will run. So first thing is we need to launch worker, to run/launch a worker we need java installed on that machine then we need Kafka connect related sh/bat files to launch worker and kafka libs which will be used by kafka connect worker, for this we will just simply copy/install Kafka in the worker machine, also we need to copy all the connector and connect-task related jars/dependencies in "plugin.path" as defined in the below worker properties file, now worker machine is ready, to start worker we need to invoke ./bin/connect-distributed.sh ./config/connect-distributed.properties, here connect-distributed.properties will have configuration for worker. The same thing has to be repeated in each machine where we need to run Kafka connect.
Now the worker java process is running in all machines, the woker config will have group.id property, the workers which have this same property value will be forming a group/cluster of workers.
Each worker process will expose rest endpoint (default http://localhost:8083/connectors), to launch/start a connector on the running workers, we need do http-post a connector config json, based on the given config the worker will start the connector and the number of tasks in the above group/cluster workers.
Example: Connect post,
curl -X POST -H "Content-Type: application/json" --data '{"name": "local-file-sink", "config": {"connector.class":"FileStreamSinkConnector", "tasks.max":"3", "file":"test.sink.txt", "topics":"connect-test" }}' http://localhost:8083/connectors
kafka has a zookeeper.
Is it ok to use it on production?
bin/zookeeper-server-start.sh
I want to use SASL with kafka. However I cann't find a way to chieve it with the offical zookeeper. I did make it work with the kafka zookeeper. Therefore I want to know if it's ok to use the zookeeper which is in kafka on production environment.
Yes the zookeeper that comes bundled with Apache Kafka is great for production use. No need to install any different version of zookeeper from anywhere else.
Performing install via Ambari 1.7 and would like to get some clarification regarding the Zookeeper installation. The setup involves (3) Zookeeper and (3) Kafka instances.
Ambari UI asks to specify Zookeeper master(s) and Zookeeper clients/slaves. Should I choose all three Zookeeper nodes as masters and install Zookeeper client on each Kafka server?
Zookeeper doesn't have any master node(s) and I am a little confused here with this Ambari master/slave terminology.
Zookeeper Server is considered a MASTER component in Ambari terminology. Kafka has the requirement that Zookeeper Server be installed on at least one node in the cluster. Thus the only requirement you have is to install Zookeeper server on one of the nodes in your cluster for Kafka to function. Kafka does not require Zookeeper clients on each Kafka node.
You can determine all this information by looking at the Service configurations for KAFKA and ZOOKEEPER. The configuration is specified in the metainfo.xml file for each component under the stack definition. The location of the definitions will differ based on the version of Ambari you have installed.
On newer versions of Ambari this location is:
/var/lib/ambari-server/resources/common-services/<service name>/<service version>
On older version of Ambari this location is:
/var/lib/ambari-server/resources/stacks/HDP/<stack version>/services/<service name>