Snowflake Kafka connector doubts and questions - apache-kafka

I am using 3 server cluster for the Kafka Configuration, with Snowflake connector REST API to push the data to Snowflake database: All are 3 different VMs running on AWS
1.In this, does we require 3 kafka individual server zookeeper-services needs to be up and running in cluster else only 1 is enough, as if it needs to be executed in all the 3 servers zookeeper services, does it require different port configurations like for ex:
1.a:zookeeper.connect=xx.xx.xx.xxx:2181, xx.xx.xx.xxx:2182, xx.xx.xx.xxx:2183 else it should be 2181 in all the servers.properties file
1.b:PLAINTEXT://localhost:9091 in server1, PLAINTEXT://localhost:9092 and PLAINTEXT://localhost:9093 (Even in this it should be localhost else IP Address) that needs to be given?
1.c:server.1=<zookeeper_1_IP>:2888:3888, server.1=<zookeeper_2_IP>:2888:3888, server.1=<zookeeper_3_IP>:2888:3888 (Over here on each server the 2888:3888 needs to be same right?)
1.d:Clientport=2181 needs to be the same across the services in all 3 VMs else it needs to be different?
1.e:Does the listeners = PLAINTEXT://your.host.name:9092 on each server should have separate port like
VM-Server1:9092, VM-Server2:9093, VM-Server3:9094. Else the master server-IP should be given in the worker-nodes that is Server2 and Server3 else the own server IP of that worker-node
What should be the configuration for connector in regards with REST-API for the configuration item "tasks.max":"1". As I am going with 3 server cluster for Kafka and would be starting the 3 distribute-connector on all the 3 machines
I am getting duplicates, if I am starting the services of distributed connector in the 2nd server, how these duplicate records can be avoided. But yes if its only 1 distributed-connector is running the services, then there are no duplicates. Please advice, as the lag gets increased if only 1 distributed-connector services is up and running.
Create /data/zookeeper/myid file and give value 1 for zookeeper1 , 2 for zookeeper2 and 3for zookeeper3. Is this necessary when you are in different VM?
The distributed-connector services once started executing for sometime and then it gets disconnected
Any other parameter for the 3 server cluster architecture and best practices which needs to be followed

Kafka and Zookeeper
You only need one Kafka broker and Zookeeper server, although having more would provide fault tolerance. You don't need to manually create anything in Zookeeper such as myid files.
The ports don't need to be the same, but it is obviously easier to draw a network diagram and automate the configuration if they are.
Regarding Kafka listeners, read this post. For Zookeeper, follow its documentation if you want to create a cluster.
Or use Amazon MSK / Confluent Cloud, etc. instead of EC2, and this is all done for you.
Kafka Connect
tasks.max can be as much as you want, but if you have a source connector, then multiple threads will probably cause duplicates, yes.

Related

Configuring kafka connect with multi brokers

Steps
I have used two kafka brokers and I have started zookeeper,kafka server and kafka connect services.
I have one source type kafka connector which can be used for getting data from Database.
If i start the connector[connector 1] by using the rest API, then it will hit any one kafka server [Server 1] using load balancer.After that server 1 will store and running the connector.But server 2 does not know the connector [connector 1] which is running in the server 1.
Expectation
So if the kafka server 1 is down, then the another kafka server 2 should be able to run the connector in the failed kafka server 1.
While starting the connector, kafka server should know how many connectors are in running, so that if any one broker failed to do the job then another server will be able to continue the job.
Reality
Another Kafka server 2 which is not doing the job as per the requirement.
is there any thing to make it by configuration setup with kafka?.
Kindly suggest me some ideas.
Kafka Server 1
Kafka Server 2
It appears that you have started all processes in single pods.
You should run Kafka, Zookeeper, and Connect all as separate services in different pods.
I suggest you refer the Confluent or Strimzi sites to find Kafka Kubernetes Helm Charts / Operators
But to answer the question - You could give one or more broker to connect-distributed.properties bootstrap.server value. Then each broker is connected to as part of the Kafka cluster, and will reconnect in the event that one broker is unavailable
"Kakfa servers" (brokers) do not run Connectors
If you want to run a cluster of connect workers, you also need to setup their rest.advertised.listener address so that they can communicate with each other.

How to handle failure senario for kafka and zookeeper in kubernetes

What I have zookeeper setup which is running on server1, server2 and server3 and similarly kafka also running in server1, server2 and server3.
Setup are running in kubernetes.
Problem statement:
In case one zookeeper setup get down entire setup will get down, because kafka is depended to zookeeper. am i right?
If Q1 correct - Is there any way to make setup like if one zookeeper server will get down then kafka should run as it is?
How to expose kafka port in kubernetes setup ?
what is the recommended way to persist data in kubernetes for production server ?
I fail to see how Zookeeper questions are related to k8s... But you definitely should set affinity rules such that Zookeeper and Kafka are not on the same physical servers or sharing same disks
If one Zookeeper out of three goes down, you'll end up with a split brain event in that no single Zookeeper knows which should be responsible for leadership. This effectively can crash or corrupt Kafka, yes.
To mitigate that risk, you can choose to run 5 Zookeepers, in which case you can lose up to 3 servers to reach the same state. The Definitive Guide book covers these concepts in the first few chapters
Regarding the other questions - NodePorts and PVCs, generally speaking.
Use one of the popular Kafka Operators on Github and you'll not need to think too hard about setting those properties
You still must manually perform Kafka admin tasks in any installation... You can use extra services like Cruise Control if you want to reduce that workload, though

how many zookeper servers we need in order to support 18 kafka machines

6 kafka machines ( they are physical machines - DELL HW )
3 zookeeper server
we want to add 12 kafka machines to the cluster
in that case how many zookeeper server should be ?
in order to support 18 kafka machines ?
Well, your question was tagged with Hadoop, but for Kafka alone, 3 will "work", but 5-7 is "better".
But, these should be dedicated Zookeeper servers for Kafka, and not shared with Hadoop services such as the namenode, Hive, HBase, etc. Especially on the level of 30+ Hadoop servers. This is because Zookeeper is very latency specific, and needs lots of memory to handle these types of processes.
This can easily be done in Ambari with specific server configs, but not letting Ambari use its templates to populate the single Zookeeper quorum that it tracks (which is somewhat painful to find in every service, that it's really worth not using Ambari at all for configs, and rather Puppet or Ansible, etc, but I digress)
Keep in mind, your cluster will be 1/3 entirely unbalanced, and adding brokers will not move existing data or cause replicas to get assigned to the new brokers for existing topics

Can I query any zookeeper node to get any data?

I have a small zookeeper cluster of 3 nodes. I also have another software that needs to be configured to talk to zookeeper, also running in a cluster of 3 nodes, on the same host.
I don't know anything about how zookeeper works. Do I have to configure this other software to talk to all hosts, or should it work to just configure it to talk to localhost zookeeper?
Put another way, can a query to any zookeeper node to get any data?
If you had a ZooKeeper cluster, so you can query to any ZooKeeper node and get eventually consistent data.
For how ZooKeeper works you can check this awesome post here:Explaining Apache ZooKeeper
A lots of good projects use ZooKeeper as a backbone: HBase, Kafka, please Google it, and learn from those projects for more digest.

Kafka on Multiple Servers

I followed this link to install Kafka + Zookeeper. It all works well, yet I am setting up Kafka + Zookeeper on 2 servers.
I have setup the kafka/config/server.properties to have:
Server 1: broker.id = 0
Server 1: zookeeper.connect = localhost:2181,99.99.99.91:2181
Server 2: broker.id = 1
Server 2: zookeeper.connect = localhost:2181,99.99.99.92:2181
I am wondering the follwing:
When I publish a topic, does it go to both Instances, or just the server it's loaded on?
In order to use multiple servers like this, would I be required to use something like HAProxy with say 3 servers?
Is there anything important I am missing with using 3 servers?
Thanks for any answers.
At the very beginning, I would be assuming that the instance mentioned here refers to a Kafka server instance.
Question #1: Information for the new-created topics will be stored in Zookeeper and some key information will be loaded into all brokers' metadata cache buffer.
Question #2: No need to configure any kind of proxy servers for Kafka cluster. It's self managed to implement the fail-over and load-balancing.
Question #3: Assigning an unused id and data directory like what you did for those two brokers is enough.
At last but not least, due to the fact that ZAB requires a majority, it is best to use an odd number of machines as the zookeeper quorum.