I am having difficultly in understanding how the leader , follower mechanism works , lets say i am building a distributed application with 2 master node , 6 slave nodes and 3 zookeeper node with one zookeeper node being a leader and among 2 master node 1 being active and connected to zookeeper leader.
My questions here are
Does my master nodes are called as master just because its connected zookeeper leader , (i.e) My node called as master since its Znode connected to Zookeeper leader ?
Does a leader election mechanism happens when a leader zookeeper node dies ? and how it will impact our master , does our master would be connected to the newly elected leader ?
If our application's master node dies ,does the standby master node would be notified if it listens to master's znode , if so is it enough for our standby node has ephemeral sequential node or any other thing we need to do to make it as a master node active?
Zookeeper documentations are saying that writes are happening through only leader and it broadcasts to other follower nodes and reads are serviced directly from follower nodes .
Is this has any relation with read and write design i do with my application (i.e) i have intention to design that my writes has to be happened through my master and reads are through my slaves , zookeeper's broadcasting ability has to do anything with it ? or the zookeeper's writes are completely different from the application's write.
Sorry if i asked anything doesn't make sense , please help me to understand. Any resources which explains these would be very helpful for me.
Assumed that you are using Curator to elect master.I will explain the process of master election of Curator Recipe, then you may figure out all your questions.
Master Election use two features of ZooKeeper, ephemeral node and sequential node
The app node which got the least number will be elected as master and the session will become the ephemeral owner
After your master app node dies, ZooKeeper will delete and noticed all the node which are watching that znode
Related
I am reading one of the article related to Kafka basics. If one of the Kafka brokerX dies in a cluster then, that brokerX data copies will move to other live brokers, which are in the cluster.
If that is the case, Is zookeeper/Kafka Controller will copy the brokerX data folder and move to live brokers like copy paste from one machine hard-disc to another (physical copy)?
Or, live brokers will share a common location ? so that will zookeeper/controller will link/point to the brokerX locations(logical copy) ?
I am little hard in understanding here. Could someone help me on this?
If a broker dies, it's dead. There's no background process that will copy data off of it
The replication of topics only happens while the broker is running
Plus, that image is wrong. The partitions = 2 means exactly that. A third partition doesn't just appear when a broker dies
This all depends if the topic has a replication factor > 1. In this case, brokers holding follower replica are constantly sending fetch requests to the leader replica (a specific broker), with a goal of being head to head with the leader (both the follower replica and leader replica having the same records stored on disk).
So when a broker goes down, all it takes is for the controller to select and promote an in-sync replica (by default, but could select non insync replicas) to take over as the leader of the partition. No copy/paste required, all brokers holding a partition(s) (as a follower replica or leader replica) of that topic are storing the same information prior to shutting down.
If a broker dies the behaviour depends on the dead broker. If it was not the leader for its partition it's non problem. when the broker returns on-line it will have to copy all missing data from the leader replica. If the dead broker was the leader for the partition a new leader will be elected according to some rules. If the new elected leader was in sync before the old leader died, there will be no message loss and the follower brokers will sync their replica from the new leader, as the broken leader will do when up again. If the new elected leader was not in sync you might have some message loss. Anyway you can drive the behaviour of your kafka cluster setting various parameters to balance speed, data integrity and reliability.
The controller elects a new leader from the ISR for a partition when the current one dies. My understanding is that this data is persisted in Zookeeper. What happens when a Zookeeper node dies during this write? Can this mean that some brokers might still different leader for the newly-leader-elected partition?
I tried digging around the docs but could not find anything satisfactory.
My understanding is that this data is persisted in Zookeeper
Yes, This ISR set is persisted to ZooKeeper whenever it changes. (Reference)
What happens when a Zookeeper node dies during this write?
Zookeeper works on quorum and it means majority of servers from the cluster. (See this SO answer) (snippet below)
With a 3 node cluster, the majority is 2 nodes. So you can tolerate
only 1 node not being in sync at the same time.
With a 5 node cluster, the majority is 3 nodes. So you can tolerate
only 2 nodes not being in sync at the same time.
So long as there is majority, the decision is made and so the leader election will continue.
I have two vm servers (say S1 and S2) and need to install kafka in cluster mode where there will be topic with only one partition and two replicas(one is leader in itself and other is follower ) for reliability.
Got high level idea from this cluster setup Want to confirm If below strategy is correct.
First set up zookeeper as cluster on both nodes for high availability(HA). If I do setup zk on single node only and then that node goes down, complete cluster
will be down. Right ? Is it mandatory to use zk in latest kafka version also ? Looks it is must for older version Is Zookeeper a must for Kafka?
Start the kafka broker on both nodes . It can be on same port as it is hosted on different nodes.
Create Topic on any node with partition 1 and replica as two.
zookeeper will select any broker on one node as leader and another as follower
Producer will connect to any broker and start publishing the message.
If leader goes down, zookeeper will select another node as leader automatically . Not sure how replica of 2 will be maintained now as there is only
one node live now ?
Is above strategy correct ?
Useful resources
ISR
ISR vs replication factor
First set up zookeeper as cluster on both nodes for high
availability(HA). If I do setup zk on single node only and then that
node goes down, complete cluster will be down. Right ? Is it mandatory
to use zk in latest kafka version also ? Looks it is must for older
version Is Zookeeper a must for Kafka?
Answer: Yes. Zookeeper is still must until KIP-500 will be released. Zookeeper is responsible for electing controller, storing metadata about Kafka cluster and managing broker membership (link). Ideally the number of Zookeeper nodes should be at least 3. By this way you can tolerate one node failure. (2 healthy Zookeeper nodes (majority in cluster) are still capable of selecting a controller)) You should also consider to set up Zookeeper cluster on different machines other than the machines that Kafka is installed. Thus the failure of a server won't lead to loss of both Zookeeper and Kafka nodes.
Start the kafka broker on both nodes . It can be on same port as it is
hosted on different nodes.
Answer: You should first start Zookeeper cluster, then Kafka cluster. Same ports on different nodes are appropriate.
Create Topic on any node with partition 1 and replica as two.
Answer: Partitions are used for horizontal scalability. If you don't need this, one partition is okay. By having replication factor 2, one of the nodes will be leader and one of the nodes will be follower at any time. But it is not enough for avoiding data loss completely as well as providing HA. You should have at least 3 Kafka brokers, 3 replication factor of topics, min.insync.replicas=2 as broker config and acks=all as producer config in the ideal configuration for avoiding data loss by not compromising HA. (you can check this for more information)
zookeeper will select any broker on one node as leader and another as
follower
Answer: Controller broker is responsible for maintaining the leader/follower relationship for all the partitions. One broker will be partition leader and another one will be follower. You can check partition leaders/followers with this command.
bin/kafka-topics.sh --describe --bootstrap-server localhost:9092 --topic my-replicated-topic
Producer will connect to any broker and start publishing the message.
Answer: Yes. Setting only one broker as bootstrap.servers is enough to connect to Kafka cluster. But for redundancy you should provide more than one broker in bootstrap.servers.
bootstrap.servers: A list of host/port pairs to use for establishing
the initial connection to the Kafka cluster. The client will make use
of all servers irrespective of which servers are specified here for
bootstrapping—this list only impacts the initial hosts used to
discover the full set of servers. This list should be in the form
host1:port1,host2:port2,.... Since these servers are just used for the
initial connection to discover the full cluster membership (which may
change dynamically), this list need not contain the full set of
servers (you may want more than one, though, in case a server is
down).
If leader goes down, zookeeper will select another node as leader
automatically . Not sure how replica of 2 will be maintained now as
there is only one node live now ?
Answer: If Controller broker goes down, Zookeeper will select another broker as new Controller. If broker which is leader of your partition goes down, one of the in-sync-replicas will be the new leader. (Controller broker is responsible for this) But of course, if you have just two brokers then replication won't be possible. That's why you should have at least 3 brokers in your Kafka cluster.
Yes - ZooKeeper is still needed on Kafka 2.4, but you can read about KIP-500 which plans to remove the dependency on ZooKeeper in the near future and start using the Raft algorithm in order to create the quorum.
As you already understood, if you will install ZK on a single node it will work in a standalone mode and you won't have any resiliency. The classic ZK ensemble consist 3 nodes and it allows you to lose 1 ZK node.
After pointing your Kafka brokers to the right ZK cluster you can start your brokers and the cluster will be up and running.
In your example, I would suggest you to create another node in order to gain better resiliency and met the replication factor that you wanted, while still be able to lose one node without losing data.
Bear in mind that using single partition means that you are bounded to single consumer per Consumer Group. The rest of the consumers will be idle.
I suggest you to read this blog about Kafka Best Practices and how to choose the number of topics/partitions in a Kafka cluster.
There are two common strategies for keeping replicas in sync, primary-backup replication and quorum-based replication as stated here
In primary-backup replication, the leader waits until the write
completes on every replica in the group before acknowledging the
client. If one of the replicas is down, the leader drops it from the
current group and continues to write to the remaining replicas. A
failed replica is allowed to rejoin the group if it comes back and
catches up with the leader. With f replicas, primary-backup
replication can tolerate f-1 failures.
In the quorum-based approach, the leader waits until a write completes
on a majority of the replicas. The size of the replica group doesn’t
change even when some replicas are down. If there are 2f+1 replicas,
quorum-based replication can tolerate f replica failures. If the
leader fails, it needs at least f+1 replicas to elect a new leader.
I have a question about the statement If the leader fails, it needs at least f+1 replicas to elect a new leader in quorum based approach. My question is why quorum(majority) of at f+1 replicas is required to elect a new leader ? Why not any replica out of f+1 in-synch-replica(ISR) is selected ? Why do
we need election instead of just simple any selection?
For election, how does zookeeper elect the final leader out of remaining replicas ? Does it compare which replica is latest updated ? Also why do I need the uneven number(say 3) of zookeper to elect a leader instead even number(say 2) ?
Also why do I need the uneven number(say 3) of zookeper to elect a
leader instead even number(say 2) ?
In a quorum based system like zookeeper, a leader election requires a simple majority out of an "ensemble" - ie, nodes which form zK cluster. So for a 3 node ensemble, one node failure could be tolerated if remaining two were to form a new ensemble and remain operational. On the other hand, in a four node ensemble also, you need at-least 3 nodes alive to form a majority, so it could tolerate only 1 node failure. A five node ensemble on the other hand could tolerate 2 node failures.
Now you see that a 3 node or 4 node cluster could effectively tolerate only 1 node failure, so it make sense to have an odd number of nodes to maximise number of nodes which could be down for a given cluster.
zK leader election relies on a Paxos like protocol called ZAB. Every write goes through the leader and leader generates a transaction id (zxid) and assigns it to each write request. The id represent the order in which the writes are applied on all replicas. A write is considered successful if the leader receives the ack from the majority. An explanation of ZAB.
My question is why quorum(majority) of at f+1 replicas is required to
elect a new leader ? Why not any replica out of f+1
in-synch-replica(ISR) is selected ? Why do we need election instead of
just simple any selection?
As for why election instead of selection - in general, in a distributed system with eventual consistency, you need to have an election because there is no easy way to know which of the remaining nodes has the latest data and is thus qualified to become a leader.
In case of Kafka -- for a setting with multiple replicas and ISRs, there could potentially be multiple nodes with up-todate data that of the leader.
Kafka uses zookeeper only as an enabler for leader election. If a Kafka partition leader is down, Kafka cluster controller gets informed of this fact via zK and cluster controller chooses one of the ISR to be the new leader. So you can see that this "election" is different from that of a new leader election in a quorum based system like zK.
Which broker among the ISR is "selected" is a bit more complicated (see) -
Each replica stores messages in a local log and maintains a few important offset positions in the log. The log end offset (LEO) represents the tail of the log. The high watermark (HW) is the offset of the last committed message. Each log is periodically synced to disks. Data before the flushed offset is guaranteed to be persisted on disks.
So when a leader fails, a new leader is elected by following:
Each surviving replica in ISR registers itself in Zookeeper.
The replica that registers first becomes the new leader. The new leader chooses its Log End Offset(LEO) as the new High Watermark (HW).
Each replica registers a listener in Zookeeper so that it will be informed of any leader change. Everytime a replica is notified about a new leader:
If the replica is not the new leader (it must be a follower), it truncates its log to its HW and then starts to catch up from the new leader.
The leader waits until all surviving replicas in ISR have caught up or a configured time has passed. The leader writes the current ISR to Zookeeper and opens itself up for both reads and writes.
Now you can probably appreciate the benefit of a primary backup model when compared to a quorum model - using the above strategy, a Kafka 3 node cluster with 2 ISRs can tolerate 2 node failures -- including a leader failure -- at the same time and still get a new leader elected (though that new leader would have to reject new writes for a while till one of the failed nodes come live and catches up with the leader).
The price to pay is of course higher write latency - in a 3 node Kafka cluster with 2 ISRs, the leader has to wait for an acknowledgement from both followers in-order to acknowledge the write to the client (producer). Whereas in a quorum model, a write could be acknowledged if one of the follower acknowledges.
So depending upon the usecase, Kafka offers the possibility to trade durability over latency. 2 ISRs means you have sometimes higher write latency, but higher durability. If you run with only one ISR, then in case you lose the leader and an ISR node, you either have no availability or you can choose an unclean leader election in which case you have lower durability.
Update - Leader election and preferred replicas:
All nodes which make up the cluster are already registered in zK. When one of the node fails, zK notifies the controller node(which itself is elected by zK). When that happens, one of the live ISRs are selected as new leader. But Kafka has the concept of "preferred replica" to balance leadership distribution across cluster nodes. This is enabled using auto.leader.rebalance.enable=true, under which case controller will try to hand over leadership to that preferred replica. This preferred replica is the first broker in the list of ISRs. This is all a bit complicated, but only Kafka admin need to know about this.
Just started to study zookeeper architecture and its communication with the hbase
I have some doubt in leader election of zookeeper cluster
As far as i learned zookeepers will select its master using transaction
id but when we freshly connect the zookeeper cluster all zookeeper's
transaction id will be zero now how it will select its leader.....
Can any one please explain in detail.....
Thanks in advance
There are several metrics zookeeper will take into consideration while in an election, like epoch/zxid/id, which you can verify from the implementation of FastLeaderElection.totalOrderPredicate.
To answer your question, if all your zk nodes start at the same time in a freshly cluster, the one with the biggest id (which is specified by you in the file myid) will be elected as the leader.