Kafka on Multiple Servers - apache-kafka

I followed this link to install Kafka + Zookeeper. It all works well, yet I am setting up Kafka + Zookeeper on 2 servers.
I have setup the kafka/config/server.properties to have:
Server 1: broker.id = 0
Server 1: zookeeper.connect = localhost:2181,99.99.99.91:2181
Server 2: broker.id = 1
Server 2: zookeeper.connect = localhost:2181,99.99.99.92:2181
I am wondering the follwing:
When I publish a topic, does it go to both Instances, or just the server it's loaded on?
In order to use multiple servers like this, would I be required to use something like HAProxy with say 3 servers?
Is there anything important I am missing with using 3 servers?
Thanks for any answers.

At the very beginning, I would be assuming that the instance mentioned here refers to a Kafka server instance.
Question #1: Information for the new-created topics will be stored in Zookeeper and some key information will be loaded into all brokers' metadata cache buffer.
Question #2: No need to configure any kind of proxy servers for Kafka cluster. It's self managed to implement the fail-over and load-balancing.
Question #3: Assigning an unused id and data directory like what you did for those two brokers is enough.
At last but not least, due to the fact that ZAB requires a majority, it is best to use an odd number of machines as the zookeeper quorum.

Related

Snowflake Kafka connector doubts and questions

I am using 3 server cluster for the Kafka Configuration, with Snowflake connector REST API to push the data to Snowflake database: All are 3 different VMs running on AWS
1.In this, does we require 3 kafka individual server zookeeper-services needs to be up and running in cluster else only 1 is enough, as if it needs to be executed in all the 3 servers zookeeper services, does it require different port configurations like for ex:
1.a:zookeeper.connect=xx.xx.xx.xxx:2181, xx.xx.xx.xxx:2182, xx.xx.xx.xxx:2183 else it should be 2181 in all the servers.properties file
1.b:PLAINTEXT://localhost:9091 in server1, PLAINTEXT://localhost:9092 and PLAINTEXT://localhost:9093 (Even in this it should be localhost else IP Address) that needs to be given?
1.c:server.1=<zookeeper_1_IP>:2888:3888, server.1=<zookeeper_2_IP>:2888:3888, server.1=<zookeeper_3_IP>:2888:3888 (Over here on each server the 2888:3888 needs to be same right?)
1.d:Clientport=2181 needs to be the same across the services in all 3 VMs else it needs to be different?
1.e:Does the listeners = PLAINTEXT://your.host.name:9092 on each server should have separate port like
VM-Server1:9092, VM-Server2:9093, VM-Server3:9094. Else the master server-IP should be given in the worker-nodes that is Server2 and Server3 else the own server IP of that worker-node
What should be the configuration for connector in regards with REST-API for the configuration item "tasks.max":"1". As I am going with 3 server cluster for Kafka and would be starting the 3 distribute-connector on all the 3 machines
I am getting duplicates, if I am starting the services of distributed connector in the 2nd server, how these duplicate records can be avoided. But yes if its only 1 distributed-connector is running the services, then there are no duplicates. Please advice, as the lag gets increased if only 1 distributed-connector services is up and running.
Create /data/zookeeper/myid file and give value 1 for zookeeper1 , 2 for zookeeper2 and 3for zookeeper3. Is this necessary when you are in different VM?
The distributed-connector services once started executing for sometime and then it gets disconnected
Any other parameter for the 3 server cluster architecture and best practices which needs to be followed
Kafka and Zookeeper
You only need one Kafka broker and Zookeeper server, although having more would provide fault tolerance. You don't need to manually create anything in Zookeeper such as myid files.
The ports don't need to be the same, but it is obviously easier to draw a network diagram and automate the configuration if they are.
Regarding Kafka listeners, read this post. For Zookeeper, follow its documentation if you want to create a cluster.
Or use Amazon MSK / Confluent Cloud, etc. instead of EC2, and this is all done for you.
Kafka Connect
tasks.max can be as much as you want, but if you have a source connector, then multiple threads will probably cause duplicates, yes.

What is the difference between Kafka Cluster and Kafka Broker?

Has Kafka cluster and Kafka broker the same meaning?
I know cluster has multiple brokers (Is this wrong?).
But when I write code to produce messages, I find awkward option.
props.put("bootstrap.servers", "kafka001:9092, kafka002:9092, kafka003:9092");
Is this broker address or cluster address? If this is broker address, I think it is not good because we have to modify above address when brokers count changes.
(But it seems like broker address..)
Additionally, I saw in MSK in amazon, we can add broker to each AZ.
It means, we cannot have many broker. (Three or four at most?)
And they guided we should write this broker addresses to bootstrap.serveroption as a,` seperated list.
Why they don't guide us to use clusters address or ARN?
A Kafka cluster is a group of Kafka brokers.
When using the Producer API it is not required to mention all brokers within the cluster in the bootstrap.servers properties. The Producer configuration documentation on bootstrap.servers gives the full details:
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
All brokers within a cluster share meta information of other brokers in the same cluster. Therefore, it is sufficient to mention even only one broker in the bootstrap-servers properties. However, you should still mention more than one in case of the one broker being not available for whatever reason.

How to handle failure senario for kafka and zookeeper in kubernetes

What I have zookeeper setup which is running on server1, server2 and server3 and similarly kafka also running in server1, server2 and server3.
Setup are running in kubernetes.
Problem statement:
In case one zookeeper setup get down entire setup will get down, because kafka is depended to zookeeper. am i right?
If Q1 correct - Is there any way to make setup like if one zookeeper server will get down then kafka should run as it is?
How to expose kafka port in kubernetes setup ?
what is the recommended way to persist data in kubernetes for production server ?
I fail to see how Zookeeper questions are related to k8s... But you definitely should set affinity rules such that Zookeeper and Kafka are not on the same physical servers or sharing same disks
If one Zookeeper out of three goes down, you'll end up with a split brain event in that no single Zookeeper knows which should be responsible for leadership. This effectively can crash or corrupt Kafka, yes.
To mitigate that risk, you can choose to run 5 Zookeepers, in which case you can lose up to 3 servers to reach the same state. The Definitive Guide book covers these concepts in the first few chapters
Regarding the other questions - NodePorts and PVCs, generally speaking.
Use one of the popular Kafka Operators on Github and you'll not need to think too hard about setting those properties
You still must manually perform Kafka admin tasks in any installation... You can use extra services like Cruise Control if you want to reduce that workload, though

Building a Kafka Cluster using two servers only

I'm planning to build a Kafka Cluster using two servers, and host Zookeeper on these two servers as well.
The Question is, since Kafka requires Zookeeper to run, what is the best cluster build for zookeeper to implement Kafka Cluster on two servers?
for eg. I'm currently running two zookeepers on both servers and one Kafka on each server, and in the Kafka configuration they point to all Zookeepers.
Is there a better way to do this?
First of all, you don't have to setup Zookeper and Kafka in the same server. One of the roles of Zookeeper is electing controller. (one of the brokers which is responsible for maintaining the leader/follower relationship for all the partitions) For election; majority of Zookeper nodes must be alive. In your case even one Zookeeper instance is down, you cannot select controller. So there is no difference between having one Zookeper or two. That's why it is recommended to have at least 3 nodes in Zookeeper cluster. By this way you can handle failure of one Zookeeper node.
An addition to this, it is highly recommended to have at least three brokers in your Kafka cluster to maintain both consistency and high availability. (link1, link2)
UPDATE:
As long as you are limited to only two servers, then you can consider sacrificing from high availability by set up your broker by setting min.insync.replicas=2 and having topics with replication.factor=2. If HA is more important than data loss, then you can use min.insync.replicas=1 (default) broker config with again topic replication.factor=2. In this circumstance, your options are these IMHO. (Having one or two Zookeepers is not important as I mentioned above)
I am often faced with the same problem as you do #frisky5 where i would like to achieve a "suboptimal" HA system using only 2 nodes, and thus workarounds are always needed with cloud-native frameworks that rely on the assumption that clusters will have lot of nodes available.
That ain't always the case in real life, is it ;) ?
That being said, i see you essentially having 2 options:
Externalize zookeeper configuration on a replicated storage system using 2 nodes (e.g. DRBD)
Replicate Kafka data volumes entirely on the second nodes and use 2 one-node Kafka clusters that you switch on and off depending on who is the current master node.
I would go for the first option. In that case you would have 2 Kafka servers and one zookeeper server whose ip needs to be static (virtual ip). When the zookeeper node goes down, it is restarted one the second node with same VIP, but it needs to access the synchronized data folder.
I am not too familiar with zookeepers internals and i can't tell you whether it will go in conflict when starting up on a data store who "wasn't its own" but i would guess it makes sense for you to test it using a simple rsync setup.
Another way to achieve consensus if you are using a k3s based kubernetes cluster would be to rely on internal k8s distributed consensus mechanics to "tell Kafka" which node is the leader. This works for the postgresoperator by chruncydata because Patroni is cool ( https://patroni.readthedocs.io/en/latest/kubernetes.html ) 😎 but i am not sure if Kafka/zookeeper are that flexible and can communicate with a rest API to set their locks ...
Once you have achieved this intermediate step, then you can use a PostgreSQL db as external source of truth for k3s and then it is as simple as syncing the postgres data folder between the machines (easily done with rsync). The beauty of this approach is that it is way more generic and could be used for other systems too.
Let me know what do you think about these two approaches and whether you manage to setup a test environment. If you do on GitHub i can help you out with implementation

Kafka in distributed system

I am new to kafka , i am running kafka in a single machine as of now. I want to run kafka in an distributed environment on multiple machines. There is no proper documentation for this. Any documentation or suggestion on this will be really helpful.
Adding on to the previous answer by user2720864
Let us assume that Kafka system with below configuration is needed.
7 Kafka nodes
3 Zoo keepers
To achieve this install 7 Kafka instances, in 7 different server/vm(instances), and in each of these instances set a different broker-id, this will let the zookeeper identify the different kafka nodes for bookkeeping, maintenance.
broker.id=X (/config/server.properties)
To start zookeepers, you can use 3 of the previous kafka instances or can use new servers to start zookeepers. Once the servers on which zookeepers run are decided, change the /config/server.properties to specify zookeepers.
zookeeper.connect=hostname1:port1,hostname2:port2
In a distributed environment its nice to have 3 zoo keepers. While there is only one zookeeper which acts as a true master, other 2 zookeepers act as fail overs. When the master fails one of the two ZKs will take over as master.
I found this link to be very useful, it helped me clarify a lot of things about kafka architecture.
This is a good reference for all the configurations on the property files in kafka.
Hope this helps!
Basically you need to do the follwing
1) Set up kafka on all the machines
2) Configure the config/server1.properties properties file to specify an unique id for each machines. You can do that by setting the broker.id properties in the config file. e.g. broker.id=1, broker.id=2. For every brokers this id should be unique. This is how every node is identified in a kafka cluster.
3) Start kafka in all nodes
You can refer Step 6: Setting up a multi broker cluster from their official quick start page.
Also here is a nice article worth taking a look