Installing kafka cluster - apache-kafka

I want to install 2 node Kafka cluster on Amazon EC2.
I follow the steps from this link: https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-14-04
Also, I want to have zookeeper on both nodes, because If I have it only on one node, if that node dies, my kafka cluster dies.
In step 9 (Installing multi-node cluster), they say that I need to modify zookeeper.connect in kafka server properties, so that it has comma separated list of ip:port for each node where zookeeper is installed.
On the other hand, when I want to create a topic, in the script I only specify 1 zookeeper!
1) Will the other zookeeper node know that the topic has been created?
2) In case that 1 zookeeper node fails, will the other one takeover?
3) `When the failed node goes up again, will it take again the information about topics from the node that stayed alive?
Regards,
Srdjan

You should create a cluster with no less than three nodes. Like Serejja mentioned, it should be odd-numbered for fault-tolerance.
3,5,7,9 etc.
For Kafka, you should specify a --replication-factor when creating the topic. In a three node cluster, it's recommended to set it to two or three.
In this scenario if one of the brokers goes down, the data will get replicated across the available nodes, and then once the unavailable node comes back online, the data will propagate to it.
The Kafka Documentation is fantastic, and I recommend further reading of the Replication topic.

Related

Kafka setup strategy for replication?

I have two vm servers (say S1 and S2) and need to install kafka in cluster mode where there will be topic with only one partition and two replicas(one is leader in itself and other is follower ) for reliability.
Got high level idea from this cluster setup Want to confirm If below strategy is correct.
First set up zookeeper as cluster on both nodes for high availability(HA). If I do setup zk on single node only and then that node goes down, complete cluster
will be down. Right ? Is it mandatory to use zk in latest kafka version also ? Looks it is must for older version Is Zookeeper a must for Kafka?
Start the kafka broker on both nodes . It can be on same port as it is hosted on different nodes.
Create Topic on any node with partition 1 and replica as two.
zookeeper will select any broker on one node as leader and another as follower
Producer will connect to any broker and start publishing the message.
If leader goes down, zookeeper will select another node as leader automatically . Not sure how replica of 2 will be maintained now as there is only
one node live now ?
Is above strategy correct ?
Useful resources
ISR
ISR vs replication factor
First set up zookeeper as cluster on both nodes for high
availability(HA). If I do setup zk on single node only and then that
node goes down, complete cluster will be down. Right ? Is it mandatory
to use zk in latest kafka version also ? Looks it is must for older
version Is Zookeeper a must for Kafka?
Answer: Yes. Zookeeper is still must until KIP-500 will be released. Zookeeper is responsible for electing controller, storing metadata about Kafka cluster and managing broker membership (link). Ideally the number of Zookeeper nodes should be at least 3. By this way you can tolerate one node failure. (2 healthy Zookeeper nodes (majority in cluster) are still capable of selecting a controller)) You should also consider to set up Zookeeper cluster on different machines other than the machines that Kafka is installed. Thus the failure of a server won't lead to loss of both Zookeeper and Kafka nodes.
Start the kafka broker on both nodes . It can be on same port as it is
hosted on different nodes.
Answer: You should first start Zookeeper cluster, then Kafka cluster. Same ports on different nodes are appropriate.
Create Topic on any node with partition 1 and replica as two.
Answer: Partitions are used for horizontal scalability. If you don't need this, one partition is okay. By having replication factor 2, one of the nodes will be leader and one of the nodes will be follower at any time. But it is not enough for avoiding data loss completely as well as providing HA. You should have at least 3 Kafka brokers, 3 replication factor of topics, min.insync.replicas=2 as broker config and acks=all as producer config in the ideal configuration for avoiding data loss by not compromising HA. (you can check this for more information)
zookeeper will select any broker on one node as leader and another as
follower
Answer: Controller broker is responsible for maintaining the leader/follower relationship for all the partitions. One broker will be partition leader and another one will be follower. You can check partition leaders/followers with this command.
bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic my-replicated-topic
Producer will connect to any broker and start publishing the message.
Answer: Yes. Setting only one broker as bootstrap.servers is enough to connect to Kafka cluster. But for redundancy you should provide more than one broker in bootstrap.servers.
bootstrap.servers: A list of host/port pairs to use for establishing
the initial connection to the Kafka cluster. The client will make use
of all servers irrespective of which servers are specified here for
bootstrapping—this list only impacts the initial hosts used to
discover the full set of servers. This list should be in the form
host1:port1,host2:port2,.... Since these servers are just used for the
initial connection to discover the full cluster membership (which may
change dynamically), this list need not contain the full set of
servers (you may want more than one, though, in case a server is
down).
If leader goes down, zookeeper will select another node as leader
automatically . Not sure how replica of 2 will be maintained now as
there is only one node live now ?
Answer: If Controller broker goes down, Zookeeper will select another broker as new Controller. If broker which is leader of your partition goes down, one of the in-sync-replicas will be the new leader. (Controller broker is responsible for this) But of course, if you have just two brokers then replication won't be possible. That's why you should have at least 3 brokers in your Kafka cluster.
Yes - ZooKeeper is still needed on Kafka 2.4, but you can read about KIP-500 which plans to remove the dependency on ZooKeeper in the near future and start using the Raft algorithm in order to create the quorum.
As you already understood, if you will install ZK on a single node it will work in a standalone mode and you won't have any resiliency. The classic ZK ensemble consist 3 nodes and it allows you to lose 1 ZK node.
After pointing your Kafka brokers to the right ZK cluster you can start your brokers and the cluster will be up and running.
In your example, I would suggest you to create another node in order to gain better resiliency and met the replication factor that you wanted, while still be able to lose one node without losing data.
Bear in mind that using single partition means that you are bounded to single consumer per Consumer Group. The rest of the consumers will be idle.
I suggest you to read this blog about Kafka Best Practices and how to choose the number of topics/partitions in a Kafka cluster.

How to scale Zookeeper with kafka

I am working on scaling the kafka cluster in Prod. Confluent provides easy way to add kafka brokers. However, how do I know how to scale zookeeper along with Kafka. What should be the ratio? Right now we have 5 zookeeper nodes for 5 kafka brokers. If I have 10 kafka brokers how many zookeeper nodes should be there?
Zookeeper works as a coordination service for Apache Kafka which stores metadata of kafka cluster. Zookeeper cluster is called ensemble.
Number of servers in a zookeeper ensemble are an odd number(3,5 etc).These numbers represents, how much your cluster is fault tolerant.A three node ensemble ,you can run with one node missing.
With five node ensemble,you can run with two nodes missing and your cluster will be available.
You can add as many zookeeper servers based on how much you want system to be functional inspire of failures, however a ZooKeeper cluter of more than 7 nodes is not recommended for issues with overhead of latency and over-communication between those nodes.

kafka with multiple zookeeper config

A bit confused about clustering setup:
Zookeeper could be setup as a cluster by configuring myid (1,2,3...) in the file and having for example zookeeper1:2888:3888, zookeeper2:2889:3889 in the zoo.cfg file
For Kafka, in the server.properties file, is it must to specify the full list of zookeeper server for parameter zookeeper.connect, or just 1 is enough? Is there any differences?
I've seen practices of specifying the full list of zookeeper server even when creating a topic, e.g. /opt/kafka/bin/kafka-topics.sh --create --zookeeper x.x.x.x:2181,x.x.x.x:2181,x.x.x.x:2181 --replication-factor 1 --partitions 1 --topic sample_test
---Production and DR setup (large latency is expected between production and dr)---
Let's say, having 1 Kafka (kafka1) and 1 zookeeper server (zookeeper1) in production, 1 kafka (kafka2) and 1 zookeeper server (zookeeper2) in DR, and form those 2 zookeepers into a cluster;
running uReplicator to replicate data in production to DR; from uReplicator example, it seems the configuration is like: kafka1 (in production) is connecting to "zookeeper1:2181/cluster1", and kafka2 (in DR) is connecting to "zookeeper1:2181/cluster2", what's the meaning of "/cluster1", "/cluster2"? what's the right config for this scenario, what's the idea of having kafka2 in DR connects to zookeeper1 in prod?
is it must to specify the full list of zookeeper server for parameter zookeeper.connect
It is good practice to put at least 3 or 5. If you only put one, and that goes down, Kafka will likely not work as expected, or fail out.
in DR, and form those 2 zookeepers into a cluster
It's not generally encouraged to share Zookeepers clusters between Kafka clusters, as Kafka puts a reasonable amount of load on Zookeeper for high volume Kafka clusters.
Though, as you point out
connecting to "zookeeper1:2181/cluster1", and kafka2 (in DR) is connecting to "zookeeper1:2181/cluster2", what's the meaning of "/cluster1", "/cluster2"?
This is called a Chroot in Zookeeper. Think of it like a directory, or namespace for each unique Kafka cluster within the Zookeeper cluster.
what's the idea of having kafka2 in DR connects to zookeeper1 in prod?
Well, you wouldn't. If Kafka2 has its own unique topic data that is not being replicated to Kafka1, then pointing at the Zookeeper data that says those topics existed on Kafka2, but not Kafka1 will only result in confusion and error.
Also, I am unaware of how uReplicator works other than MirrorMaker, but you'll also want to prepare a DR strategy for Zookeeper, not only Kafka
You have two questions in there. I'll try to tackle the first one at least:
Specifying only one zookeeper server:port is usually enough, but in production instances/properties, you always want to configure all of them. If one of the servers is down, but the cluster is still up and running (say, 2 out of 3 Zookeeper servers are up), Kafka will try the next server in the config, until it finds one it can talk to. However, if the only one you chose to put happens to be down at that exact time, the server won't be able to talk to Zookeeper at all. It's best to always include the entire list of zookeeper servers in configuration.

Kafka- How to automatically use the second cluster when the first cluster is down?

I am trying to replicate data from one to another kafka cluster by using mirror maker . Suppose if master cluster is down, is it possible to automatically send the kafka messages to the second cluster ? And also is it possible to synchronise the cluster 1 with cluster 2 when the cluster 1 is up again automatically with less manual intervention?
any help is highly appreciated .
I think you meant to ask how to maintain copies between Kafka brokers, that together are considered to be a Kafka Cluster.
If that's the case, it's pretty simple, all you have to do is configure a Kafka Cluster and to create a topic with replication factor with size that is equal to the size of the nodes in the Cluster.
For example:
Let's say that we want to have 3 Brokers on our Kafka Cluster, then you'll need to prepare for each broker a different configuration file, then startup them as a cluster, and then create a topic with replication factor of 3.
Kafka will be responsible for maintaining the Fault Tolerance.
For further info on actually do the configuration, watch these videos on youtube:
https://www.youtube.com/channel/UCDLPjuuYHxPbHdN8RXxrGdw

zookeeper failover for kafka cluster

I am wondering is there any way to make the zookeeper failover for kafka cluster.
For example: i want to setup 2 zookeeper instances for my kafka cluster. In case of one zookeeper fails, Kafka servers still able to read metadata of topics from second zookeeper.
any advice is highle appricicated.
Zookeeper works as a so-called quorum – a cluster of nodes that forms a consensus based on simple majority votes.
For production, you should use 3 or 5 Zookeeper instances in a quorum.
If you're using 3, your cluster can survive losing one server (because the remaining two form a simple majority). With 5, you can lose two servers because 3 is a majority of 5.
2 is a bad idea because your cluster won't work if 1 node goes down.
Please check this question
$KAFKA_HOME/config/server.properties
Here you can set multiple zookeeper
zookeeper.connect=<server1>:2181,<server2>:2181,<server2>:2181
Maintain 2n+1(quorum ) rule in case of zookeeper