Why only n failed servers are allowed in multi-node Zookeper ensembles? - apache-zookeeper

I am reading through the theory for Apache Kafka and came across Zookeeper quorum allowance. I wanted to know why only n failed servers are allowed to keep a quorum? If we are using 5 servers, then why not allow 3 servers to fail and still not let Zookeeper go down? We are left with 2 servers here, which is the same if we use 3 server configuration and allow one to fail? Another question, if we allow 1 to fail in 3 server configuration, then isn't the odd number rule voilated? Or this odd number rule is for general case and we randomly select an output from the two in case of a clash?

Zookeeper won't "go down" with an even number. It'll just be confused because quorum, majority rule of (N+1) / 2 servers, cannot be reached amongst the remaining servers. e.g. With 3 brokers, 2 in agreement (not two in total, giving different information to each other) are needed for a quorum, or with 5, then 3 are needed

Related

MongoDB - cross data center primary election DRP / Geographically Distributed Replica Sets

Working with mongo distributed over 3 data center
for this example the data center names are A,B,C
when every thing is going well all user traffic is pointed to A
so the mongo primary is on A, the mongo setup is :
3 servers in A (with high priority)
1 servers in B (with low priority)
1 servers in C (priority 0 )
problem is supporting mongo-writes when 2 scenario happen:
no network between A-B-C (network tunnel is down)
data canter A is on fire :), lets say the data-center isnt working, in this point all user traffic is pointed to B and a primary election in B is expected.
scenario 1 isnt a problem, when no datacenter network tunnel the A still has a majority of replicas and high proirity so every thing is still working.
scenario 2 wont work, beacuse when A will stop working , all 3 replicas (on A) arent reachable, in this way no new primary will be reelacted in B or C beacuse the majority of replicas is down.
how can i setup my replica set so it supports the 2 scenarios?
This is not possible: You can't have an 'available' system in case of total network partitions and in case of failure of a DC with a majority election approach as used by MongoDB: Either the majority is in one DC, then it will survive partitions but not a DC going down, or the majority requires 2 DCs to be up which can survive one DC going down but not a full network failure.
Your options:
Accept the partition problem and change the setup to 2-2-1. Unreliable tunnels should be solvable, if the entire network of a DC goes down you're at scenario 2.
Accept the DC problem and stick to your configuration. The most likely problems are probably large-scale network issues and massive power outages, not fire.
Use a database that supports other types of fault-tolerance. That, however, is not a panacea since this entails other tradeoffs that must be well understood.
To keep the system up when DC A goes down also requires application servers in DC B or C, which is a tricky problem in its own regard. If you use a more partition tolerant database, for instance, you could easily have a 'split brains' problem where application servers in different DCs accept different, but conflicting writes. Such problems can only be solved at the application level.

Difference between ensemble and quorum in zookeeper

I am new to zookeeper. I have configured it on a single machine. But I came across the words "ensemble" and "quorum" in the documentation of zookeeper.
Can anyone please tell me the difference between these?
Ensemble
Quorum
This answer is for those who still have doubt understanding Ensemble and Quorum. Ensemble is nothing but a cluster of Zookeeper servers, where in Quorum defines the rule to form a healthy Ensemble. Which is defined using a formula Q = 2N+1 where Q defines number of nodes required to form a healthy Ensemble which can allow N failure nodes. You will understand about this formula in the following example.
Before I start with an example, I want to define 2 things-
Cluster: Group of connected nodes/servers (now on will use node) with one node as Leader/Master and rest as Followers/Slaves.
Healthy Ensemble: A cluster with only one active Leader at any given point of time, hence fault tolerant.
Let me explain with an example, which is used commonly across while defining Ensemble and Quorum.
Lets say you have 1 zookeeper node. No need to worry here as we need more than 1 node to form a cluster.
Now take 2 nodes. There is no problem forming a cluster but there is problem to form a healthy Ensemble, because - Say the connection between these 2 nodes are lost, then both nodes will think the other node is down, so both of them try to act as Leader, which leads to inconsistency as they can't communicate with each other. Which means cluster of 2 nodes can't even afford even a single failure, so what is the use of this cluster??. They are not saying you can't make a cluster of 2 nodes, all they are saying is - it is same as having single node, as both don't allow even a single failure. Hope this is clear
Now take 3 nodes. There is no problem forming a cluster or healthy Ensemble - as this can allow 1 failure according the formula above 3 = 2N+1 => N = (3-1)/2 = 1. So when the next failure occurs (either connection or node failure), no node will be elected as Leader, hence the Ensemble won't serve any write/update/delete services, hence the states of the client cluster remains consistent across zookeeper cluster nodes. So the Leader election won't happen until there is majority nodes available and connected, where Majority m = (n/2)+1, where n stands for number of nodes available when the previous election happened. So here, 1st election happened with 3 nodes (as its a 3 node cluster). Then there was a 1st failure, so remaining 2 nodes can conduct election, as they have majority m = (3/2)+1 = 2. Then 2nd failure happened, now they don't have majority as there is only one node available for election, but the majority required is m = (2/2)+1 = 2.
Now take 4 nodes. There is no problem forming a cluster or healthy Ensemble, but having 4 nodes is same as 3 nodes, because both allows only 1 failure. Lets derive it from the Quorum formula 4 = 2N+1 => N = (4-1)/2 = ⌊1.5⌋ = 1 //floor(1.5)=1
Now take 5 nodes. There is no problem forming a cluster or healthy Ensemble - as this can allow 2 failure according the formula above 5 = 2N+1 => N = (5-1)/2 = 2.
Now take 6 nodes. There is no problem forming a cluster or healthy Ensemble, but having 6 nodes is same as 5 nodes, because both allows only 2 failure. Lets derive it from the Quorum formula 6 = 2N+1 => N = (6-1)/2 = ⌊2.5⌋ = 2
Conclusion:
To form a Quorum we need atleast 3 nodes - as 2 node cluster can't even handle single failure
Its good to form an Ensemble of odd number of nodes - as n (even number) nodes tends to allow same number of failure as of n-1 (odd number) nodes
Its not good to have more nodes, as they add latency into performance. Suggested Production cluster size is 5 - if one server is down for maintenance, it still can handle one more failure.
Ensemble is an array of nodes (or servers, if you like) that form your Distributed Computer Ecosystem.
Quorum is when things get interesting. On a specific assignment/job, Quorum ensures that a healthy leader-follower majority can be maintained. In other words, a conduct which ensures that majority vote can be obtained to proceed with an activity (e.g. commit/update/delete etc.). In Replication strategy, quorum is a must to have.
Lets try and use non-technical examples:
1) In your company - there is a board formed by 5 directors (ensemble).
|d1, d2, d3, d4, d5|----- BoD
2) Each director has equal say in each decision. But a majority if 3 directors at anytime should agree on a project. If no majority is there, the company will be dysfunctional.
3) One a particular project, P1 - they randomly voted to have a majority of d1,d2,d3 to be decision makers in the project. but d4 and d5 are fully aware of what's going on (so that they can step in anytime).
4) Now (God forbid), d3 passes away after a few months, again everyone agrees that the majority will be formed using d1,d2,d4. d5 is still aware of what's going on. NOte that we only have 4 directors left.
5) Disaster strikes again. d5 leaves the company for another competitor. But that doesn't change anything because the company is still functional with a 3-member BoD.
6) At any point of another disaster strikes the BoD and any of the directors become "unavailable" - company is dysfunctional i.e. we have lost the quorum forming criterion.
Zookeeper uses ceil(N/2) - 1 formula to get the maximum number of failures allowed for an Ensemble and maintain a stable quorum. In this case, the minimum recommended ensemble nodes are 3 (tolerates 1 failure maximum).
When you want to have high availability in zookeeper server you use multiple zookeeper servers to create an ensemble. Basically zookeeper has master-slave architecture. In an ensemble there will be one master and rest will be the slaves. If the master fails one of the slaves will act as a master.
The sequence in which a master is assigned is called as quorum. When you create an ensemble, zookeeper internally creates a sequence ID for the slave severs. When the main master fails it will check the next sequence ID to create a new master.
This concept of quorum also used while creating nodes in zookeeper.
ensemble: Numbers of nodes in the group.
Quorum: Number of required nodes to take the action.
Example: you have 5 nodes.
ensemble is 5. But according to majority rule Quorum should be 3. If we write no 3 nodes successfully, then we send success response to the client. Apache Zookeeper Quorum

What should be the majority in an ensemble for Zookeeper

I am trying to understand Zookeeper using this book - Zookeeper By Flavio Junqueira, Benjamin Reed, it is mentioned that we need to select a majority of servers for quorum as stated here:
Say that we use four servers for an ensemble. A majority of servers is
comprised of three servers. However, this system will only tolerate a
single crash, because a double crash makes the system lose majority.
Consequently, with four servers, we can only tolerate a single crash,
but quorums now are larger, which implies that we need more
acknowledgments for each request. The bottom line is that we should
always shoot for an odd number of servers.
Please help me in understanding this.
How do we select the majority of servers for a given ensemble?
Why does this statement say quorums now are larger and why do we need more acknowledgments for each request?
It just means that more servers should be up than down where each server in the ensemble should be accounted for, or that more servers have acknowledged message receipt than those that have not. With 4 servers you need 3 servers to be up to satisfy that condition, with 3, only 2. In each instance you can only tolerate the failure of one server for the cluster to still be up. The 4 node cluster is worse because you now have an extra server that is essentially not making your cluster any more fault tolerant than just a 3 node one.
Additionally, if you had 3 nodes, you would require just 2 acknowledgements to meet the quorum requirement. With 4, you need 3 acks. That would lead to a slower cluster. That's what the ' Consequently, with four servers...' statement means.

optimal number of mongodb entities (administration/devops)

I intend to check mongodb performance by running an application on 8 servers.
-1. here http://docs.mongodb.org/manual/tutorial/deploy-shard-cluster/ I read,
In production deployments, you must deploy exactly three config server
instances, each running on different servers to assure good uptime and
data safety. In test environments, you can run all three instances on
a single server.
What if I want to use optimally the resources of 8 servers (+ 1 dedicated server for the application)? Do I start 1 config server instance per server?
-2. I see here http://docs.mongodb.org/manual/core/replication-introduction/
that using replica sets with 3 mongod instances (with each mongod instance on a different server) is the way to go? Is this the optimal scenario when it comes to having 8 servers?
-3. How many replica sets would I use when I have 8 servers? 1 per server (8 servers == 8 replica sets == 3 mongod instances per server from different replica sets)?
-4. Is there any best practices documentation regarding optimization of this type?
Kind Regards,
Despot
What if I want to use optimally the resources of 8 servers (+ 1 dedicated server for the application)?
That's not an optimal way to plan, there is no way you know that you NEED 7 shards for your data.
Do I start 1 config server instance per server?
No, you are hardcoded to three.
Is this the optimal scenario when it comes to having 8 servers?
No, it is the minimum, you would ideally want more members, especially one bridging partitions; ensuring all the while you have an odd number of node on one side of your parition to ensure CAP.
Normally your replica set would consist of at least one extra member designed for backups, normally using a slaveDelay of maybe a day.
How many replica sets would I use when I have 8 servers?
Assuming (guessing) you want to use 7 shards you would have 7 replica sets, one per shard.
3 mongod instances per server from different replica sets
That would be a bad idea. You do not want to place the replica members on the same server as each other, you might as well be using no replication.
I would seriously plan more and check if you really need 7 shards, I highly doubt it.

ZooKeeper reliability - three versus five nodes

From the ZooKeeper FAQ:
Reliability:
A single ZooKeeper server (standalone) is essentially a coordinator with
no reliability (a single serving node failure brings down the ZK service).
A 3 server ensemble (you need to jump to 3 and not 2 because ZK works
based on simple majority voting) allows for a single server to fail and
the service will still be available.
So if you want reliability go with at least 3. We typically recommend
having 5 servers in "online" production serving environments. This allows
you to take 1 server out of service (say planned maintenance) and still
be able to sustain an unexpected outage of one of the remaining servers
w/o interruption of the service.
With a 3-server ensemble, if one server is taken out of rotation and one server has an unexpected outage, then there is still one remaining server that should ensure no interruption of service. Then why the need for 5 servers? Or is it more than just interruption of service that is being considered?
Update:
Thanks to #sbridges for pointing out that it has to do with maintaining a quorum. And the way that ZK defines a quorum is ceil(N/2) where N is the original number in the ensemble (and not just the currently available set).
Now, a google search for ZK quorum finds this in the HBase book chapter on ZK:
In ZooKeeper, an even number of peers is supported, but it is normally
not used because an even sized ensemble requires, proportionally, more
peers to form a quorum than an odd sized ensemble requires. For
example, an ensemble with 4 peers requires 3 to form a quorum, while
an ensemble with 5 also requires 3 to form a quorum. Thus, an ensemble
of 5 allows 2 peers to fail and still maintain quorum, and thus is more
fault tolerant than the ensemble of 4, which allows only 1 down peer.
And this paraphrasing of Wikipedia in Edward J. Yoon's blog:
Ordinarily, this is a majority of the people expected to be there,
although many bodies may have a lower or higher quorum.
Zookeeper requires that you have a quorum of servers up, where quorum is ceil(N/2). For a 3 server ensemble, that means 2 servers must be up at any time, for a 5 server ensemble, 3 servers need to be up at any time.
Basically, Zookeeper will work just fine as long as Active Zookeepers are in MAJORITY compared to failed Zookeepers.
Also, in case of even quorum size i.e 2,4,6 etc. Failed = Active, because of that its not recommended.
Both 3 and 4 will handle only 1 faliures then why whould we want to used 4 Zookeepers instead of 3.