The title may look like silly, but I really can't understand the zookeeper failover policy when the master is down although I read a lot of docs about Zookeeper. My question as following:
If I have three nodes zookeeper, then the master is down, so how do
the remaining two nodes elect the new master(right now it's an even
number nodes, how do they vote by majority).
If one of the remaining
two nodes down, then the last one will become master and continually
serve the service, right please?
If even number zookeeper nodes can
run very well, why I have to setup odd number of Zookeeper nodes?
I think you're misunderstanding what majority we're talking about here. The majority that is important is not among the remaining nodes, but among the entire cluster. So what you need to ask is: 'Can 2 nodes form a majority among 3 nodes'? And the answer is that they can, and therefore they can elect a leader.
(I do not know exactly how Zookeeper solves leader voting, but the important thing is that the goals for the nodes is not to become leader but to decide on one leader. And to convince you it's possible here's a vary simple (but slow) way of solving it: The nodes vote at random, if they have formed a majority they elect that leader else they vote again.)
No, this will not be the case.
Since the cluster is still configured around being a 3 node cluster the one node that is left can not form a majority and can therefor not elect a leader.
This is one of the reasons why a 2 node cluster can actually be worse then a 1 node one, if one of the nodes go down the cluster stops.
You do not have to, it is just recommended. And a good reason to have a odd number is that if you get a net-split that divides your cluster into two parts of same size no side can elect a leader. (This is not possible if you run a odd number of nodes.)
You can also see it as a buy-one-get-one-free type of deal, if you have 4 nodes only 1 can go down, but if you get 5 nodes it's okay for 2 to go down. But if you get 6 nodes it's still just 2 nodes that can go down without the cluster going down.
Related
If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?
TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.
I am new to zookeeper. I have configured it on a single machine. But I came across the words "ensemble" and "quorum" in the documentation of zookeeper.
Can anyone please tell me the difference between these?
Ensemble
Quorum
This answer is for those who still have doubt understanding Ensemble and Quorum. Ensemble is nothing but a cluster of Zookeeper servers, where in Quorum defines the rule to form a healthy Ensemble. Which is defined using a formula Q = 2N+1 where Q defines number of nodes required to form a healthy Ensemble which can allow N failure nodes. You will understand about this formula in the following example.
Before I start with an example, I want to define 2 things-
Cluster: Group of connected nodes/servers (now on will use node) with one node as Leader/Master and rest as Followers/Slaves.
Healthy Ensemble: A cluster with only one active Leader at any given point of time, hence fault tolerant.
Let me explain with an example, which is used commonly across while defining Ensemble and Quorum.
Lets say you have 1 zookeeper node. No need to worry here as we need more than 1 node to form a cluster.
Now take 2 nodes. There is no problem forming a cluster but there is problem to form a healthy Ensemble, because - Say the connection between these 2 nodes are lost, then both nodes will think the other node is down, so both of them try to act as Leader, which leads to inconsistency as they can't communicate with each other. Which means cluster of 2 nodes can't even afford even a single failure, so what is the use of this cluster??. They are not saying you can't make a cluster of 2 nodes, all they are saying is - it is same as having single node, as both don't allow even a single failure. Hope this is clear
Now take 3 nodes. There is no problem forming a cluster or healthy Ensemble - as this can allow 1 failure according the formula above 3 = 2N+1 => N = (3-1)/2 = 1. So when the next failure occurs (either connection or node failure), no node will be elected as Leader, hence the Ensemble won't serve any write/update/delete services, hence the states of the client cluster remains consistent across zookeeper cluster nodes. So the Leader election won't happen until there is majority nodes available and connected, where Majority m = (n/2)+1, where n stands for number of nodes available when the previous election happened. So here, 1st election happened with 3 nodes (as its a 3 node cluster). Then there was a 1st failure, so remaining 2 nodes can conduct election, as they have majority m = (3/2)+1 = 2. Then 2nd failure happened, now they don't have majority as there is only one node available for election, but the majority required is m = (2/2)+1 = 2.
Now take 4 nodes. There is no problem forming a cluster or healthy Ensemble, but having 4 nodes is same as 3 nodes, because both allows only 1 failure. Lets derive it from the Quorum formula 4 = 2N+1 => N = (4-1)/2 = ⌊1.5⌋ = 1 //floor(1.5)=1
Now take 5 nodes. There is no problem forming a cluster or healthy Ensemble - as this can allow 2 failure according the formula above 5 = 2N+1 => N = (5-1)/2 = 2.
Now take 6 nodes. There is no problem forming a cluster or healthy Ensemble, but having 6 nodes is same as 5 nodes, because both allows only 2 failure. Lets derive it from the Quorum formula 6 = 2N+1 => N = (6-1)/2 = ⌊2.5⌋ = 2
Conclusion:
To form a Quorum we need atleast 3 nodes - as 2 node cluster can't even handle single failure
Its good to form an Ensemble of odd number of nodes - as n (even number) nodes tends to allow same number of failure as of n-1 (odd number) nodes
Its not good to have more nodes, as they add latency into performance. Suggested Production cluster size is 5 - if one server is down for maintenance, it still can handle one more failure.
Ensemble is an array of nodes (or servers, if you like) that form your Distributed Computer Ecosystem.
Quorum is when things get interesting. On a specific assignment/job, Quorum ensures that a healthy leader-follower majority can be maintained. In other words, a conduct which ensures that majority vote can be obtained to proceed with an activity (e.g. commit/update/delete etc.). In Replication strategy, quorum is a must to have.
Lets try and use non-technical examples:
1) In your company - there is a board formed by 5 directors (ensemble).
|d1, d2, d3, d4, d5|----- BoD
2) Each director has equal say in each decision. But a majority if 3 directors at anytime should agree on a project. If no majority is there, the company will be dysfunctional.
3) One a particular project, P1 - they randomly voted to have a majority of d1,d2,d3 to be decision makers in the project. but d4 and d5 are fully aware of what's going on (so that they can step in anytime).
4) Now (God forbid), d3 passes away after a few months, again everyone agrees that the majority will be formed using d1,d2,d4. d5 is still aware of what's going on. NOte that we only have 4 directors left.
5) Disaster strikes again. d5 leaves the company for another competitor. But that doesn't change anything because the company is still functional with a 3-member BoD.
6) At any point of another disaster strikes the BoD and any of the directors become "unavailable" - company is dysfunctional i.e. we have lost the quorum forming criterion.
Zookeeper uses ceil(N/2) - 1 formula to get the maximum number of failures allowed for an Ensemble and maintain a stable quorum. In this case, the minimum recommended ensemble nodes are 3 (tolerates 1 failure maximum).
When you want to have high availability in zookeeper server you use multiple zookeeper servers to create an ensemble. Basically zookeeper has master-slave architecture. In an ensemble there will be one master and rest will be the slaves. If the master fails one of the slaves will act as a master.
The sequence in which a master is assigned is called as quorum. When you create an ensemble, zookeeper internally creates a sequence ID for the slave severs. When the main master fails it will check the next sequence ID to create a new master.
This concept of quorum also used while creating nodes in zookeeper.
ensemble: Numbers of nodes in the group.
Quorum: Number of required nodes to take the action.
Example: you have 5 nodes.
ensemble is 5. But according to majority rule Quorum should be 3. If we write no 3 nodes successfully, then we send success response to the client. Apache Zookeeper Quorum
I am trying to understand Zookeeper using this book - Zookeeper By Flavio Junqueira, Benjamin Reed, it is mentioned that we need to select a majority of servers for quorum as stated here:
Say that we use four servers for an ensemble. A majority of servers is
comprised of three servers. However, this system will only tolerate a
single crash, because a double crash makes the system lose majority.
Consequently, with four servers, we can only tolerate a single crash,
but quorums now are larger, which implies that we need more
acknowledgments for each request. The bottom line is that we should
always shoot for an odd number of servers.
Please help me in understanding this.
How do we select the majority of servers for a given ensemble?
Why does this statement say quorums now are larger and why do we need more acknowledgments for each request?
It just means that more servers should be up than down where each server in the ensemble should be accounted for, or that more servers have acknowledged message receipt than those that have not. With 4 servers you need 3 servers to be up to satisfy that condition, with 3, only 2. In each instance you can only tolerate the failure of one server for the cluster to still be up. The 4 node cluster is worse because you now have an extra server that is essentially not making your cluster any more fault tolerant than just a 3 node one.
Additionally, if you had 3 nodes, you would require just 2 acknowledgements to meet the quorum requirement. With 4, you need 3 acks. That would lead to a slower cluster. That's what the ' Consequently, with four servers...' statement means.
Suppose i have set w=majority as write concern and a node fails during a write operation,
will the majority be changed according to the currently alive nodes?
i.e., Suppose there are 3 nodes. Now the majority is 2. And if a node fails during a write operation, will the majority be decreased or will it remain same and wait for the node to come up?
The majority of a replica set is determined based on replica set configuration, not its current running state.
In other words, if you have a three node replica set configured, then majority is always two. If one node is offline, two is still the majority. If two nodes are offline, two is still majority and cannot be satisfied until one of the offline nodes comes back online.
I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.