Factors and Conditions that Affect Election in mongodb - mongodb

While reading the documentation i came across the below lines:
Network Partitions
Network partitions affect the formation of a majority for an election. If a primary steps down and neither portion of the replica set has a majority the set will not elect a new primary. The replica set becomes read-only.
To avoid this situation, place a majority of instances in one data center and a minority of instances in any other data centers combined.
I am not understanding the bold line. Can someone explains what it means..

For reference, OP is referring to Network Partitions section of the Replica Set Elections docs.
Suppose you have three datacenters, A, B, and C. Each datacenter has some nodes of your MongoDB replica set rs. rs has a total of 5 nodes. Due to a combination tornado / hurricane / shark attack causing a network partition, each datacenter becomes disconnected from the others. A can't talk to B, B can't talk to C, A can't talk to B, etc. If you have a majority (3) of members of rs in A, the replica set continues to be healthy, since the three members in A can elect one of their own as primary. The application will still be able to write to rs even while B and C are flooded / ensharked / torn apart by wind. If you split up the members of rs more evenly between replica sets, say with 2 in A, 2 in B, and 1 in C, the network partition would put rs in an unhealthy state where no primary could be elected. rs will be read-only and will not accept any writes until connectivity from A to at least one of B or C is restored or connectivity is restored between B and C.

Related

Number of arbiters in replication set

In MongoDB tutorial of deploying geographically distributed replica set it is said that
Ensure that a majority of the voting members are within a primary facility, “Site A”. This includes priority 0 members and arbiters.
I am confused by arbiters there since in other place in documentation I found that
There should only be at most one arbiter configured in any replica set.
So how many arbiters at most can be in a replica set? If more that one arbiter allowed, then what is the point to have more than one arbiter in replica set?
Introduction
The fact that "arbiters" is written in plural in the first sentence has style reasons, not technical reasons.
You really should have at most 1 arbiter. Iirc, you technically could have more, but to be honest with you, I never tried it. But let's assume you could for the sake of the explanation below.
You seem to be a bit unsure here, but correctly assume that it does not make any sense to have more than one arbiter.
Recap: What are arbiters there for?
An arbiter exists to provide a quorum in elections.
Take a replica set with two data bearing nodes. That setup will run as expected as long as both instances are up – they form a quorum of 2 votes of 2 original members of a replica set. If one machine goes down, however, we only have 1 vote of 2 originally present, which is not a qualified majority, and the data bearing node still running will subsequently revert to secondary state, making writes impossible.
To prevent that, an arbiter is added to the mix. An arbiter does nothing more than to track which of the available data bearing nodes has the most current data set available and vote for that member in case of an election. So with our replica set with two data bearing nodes, in order to get a qualified majority of votes in case 1 of the nodes forming the replica set goes down, we only need 1 arbiter, since 2/3 votes provides a qualified majority.
Arbiters beyond 2 data bearing nodes
If we had a replica set with 3 data bearing nodes, we would not need an arbiter, since we have 3 voting members, and if 1 member goes down, the others still form a qualified majority needed to hold an election.
A bit more abstract, we can find out wether we need an arbiter by putting in the number of votes present in a replica set into the following "formula"
needArbiter = originalVotes - floor(originalVotes/2) <= originalVotes / 2
If we put in an additional arbiter, the number of votes would be 4: 3 data bearing nodes and 1 arbiter. One node goes down, no problem. Second node goes down, and the replica set will revert to secondary state. Now let's assume one of the two nodes down are is the arbiter – we would be in secondary state while the data bearing nodes only would be able to provide a quorum. We'd have to pay for and maintain an additional arbiter without anything gained from it. So in order to provide a qualified majority again, we would need to add yet another arbiter (making 2 now), without any benefit other than the fact that two arbiters can go down. You basically would need additional arbiters to prevent situations in which the existence of the arbiter you did not need in the first place becomes a problem.
Now let's assume we have 4 data bearing nodes. Since they can not form a qualified majority when 2 of them going down, that would pretty much be the same situation as with a replica set with 3 data bearing nodes, just more expensive. So in order to allow 2 nodes of the replica set being down at the same time, we simply add an arbiter. Do more arbiters make sense? No, even less than with a replica set with two or 3 data bearing nodes, since the probability that 2 data bearing nodes and the arbiter will fail at the same time is very low. And you'd need an uneven number of arbiters.
Conclusion
Imho, with 4 data bearing nodes, the arbiter reaches its limit of usefulness. If you need a high replication factor the percentage of money saved when using an arbiter in comparison to a data bearing node becomes smaller and smaller. Remember, the next step would be 6 data bearing nodes plus an arbiter, so the costs you save is less than 1/6 of your overall costs.
So more generally speaking, the more data bearing nodes you have (the higher your "replication factor" in Mongo terms) the less reasonable it becomes to have additional arbiters. Both from the technical point of view (the probability of a majority of nodes failing the same time becomes lower and lower) and the business point of view (with a high replication factor, the money saved with an arbiter in comparison to the overall costs becomes absurdly small).
Mnemonic:
The lowest uneven number is 1.
I have a scenario where I think having more than 1 Arbiter makes sense.
Problem
I have 3 data bearing nodes in a replicaset. Now I want to distribute my replicaset geographically so that I can mitigate the risk of a datacenter outage.
3 Node Replicaset, does not solve the problem
Primary Datacenter => 2 Data bearing Nodes
Backup Datacenter => 1 Data bearing Node
If that primary datacenter is down and the two out of three nodes in the replicaset would not be available then data bearing node in backup datacenter would not be able to become a primary since majority is not available. So the 3 node configuration does not solve the problem of a datacenter outage.
5 Node replicaset
Primary Datacenter => 2 Data bearing Nodes
Backup Datacenter => 1 Data bearing Node
Third Datacenter => 2 Arbiters
In this configuration I am able to sustain outage of any of the three datacenters and still be able to operate.
Obviously, a more ideal configuration would be to have 4 data bearing nodes and have 1 arbiter. It would give me redundancy in the backup datacenter as well. However since data bearing node is a much more expensive proposition than an arbiter going with 3 data bearing nodes and 2 arbiters makes more sense and I am happy to forgo the redundancy in backup datacenter in favor of the cost saving.
For our special case it makes sense to have 2 arbiters. Let me explain: we have 3 data centers but 1 of these 3 data centers is not suitable to host data bearing members. That's why we host in this data center 2 arbiters for each replica set. The 3 data bearing members of the replSet are hosted in the two other data centers (we want to have 3 instead of 2 data bearing members for resilience reasons). If 1 of the 3 data center goes down or is not reachable due to a network partition, the replSet is still able to elect a primary, thus it's still read and writeable. This wouldn't be possible with only 1 or 0 arbiter. Hence, 2 arbiters may make sense.
Let's see how it may look like. Here are 2 replSets, each with 3 data bearing members and 2 arbiters in 3 data centers, whereas DC3 is the restricted data center:
| |DC1 |DC2 |DC3 |
|----|-----|-----|-----|
|rs1 |m1,m2|m3 |a1,a2|
|rs2 |m1 |m2,m3|a1,a2|
If one data center goes down, which replSet member would become primary?
DC1 goes down:
rs1: m3
rs2: m2 or m3
DC2 goes down:
rs1: m1 or m2
rs2: m1
DC3 goes down:
rs1: m1,m2 or m3

MongoDB - cross data center primary election DRP / Geographically Distributed Replica Sets

Working with mongo distributed over 3 data center
for this example the data center names are A,B,C
when every thing is going well all user traffic is pointed to A
so the mongo primary is on A, the mongo setup is :
3 servers in A (with high priority)
1 servers in B (with low priority)
1 servers in C (priority 0 )
problem is supporting mongo-writes when 2 scenario happen:
no network between A-B-C (network tunnel is down)
data canter A is on fire :), lets say the data-center isnt working, in this point all user traffic is pointed to B and a primary election in B is expected.
scenario 1 isnt a problem, when no datacenter network tunnel the A still has a majority of replicas and high proirity so every thing is still working.
scenario 2 wont work, beacuse when A will stop working , all 3 replicas (on A) arent reachable, in this way no new primary will be reelacted in B or C beacuse the majority of replicas is down.
how can i setup my replica set so it supports the 2 scenarios?
This is not possible: You can't have an 'available' system in case of total network partitions and in case of failure of a DC with a majority election approach as used by MongoDB: Either the majority is in one DC, then it will survive partitions but not a DC going down, or the majority requires 2 DCs to be up which can survive one DC going down but not a full network failure.
Your options:
Accept the partition problem and change the setup to 2-2-1. Unreliable tunnels should be solvable, if the entire network of a DC goes down you're at scenario 2.
Accept the DC problem and stick to your configuration. The most likely problems are probably large-scale network issues and massive power outages, not fire.
Use a database that supports other types of fault-tolerance. That, however, is not a panacea since this entails other tradeoffs that must be well understood.
To keep the system up when DC A goes down also requires application servers in DC B or C, which is a tricky problem in its own regard. If you use a more partition tolerant database, for instance, you could easily have a 'split brains' problem where application servers in different DCs accept different, but conflicting writes. Such problems can only be solved at the application level.

Difference between ensemble and quorum in zookeeper

I am new to zookeeper. I have configured it on a single machine. But I came across the words "ensemble" and "quorum" in the documentation of zookeeper.
Can anyone please tell me the difference between these?
Ensemble
Quorum
This answer is for those who still have doubt understanding Ensemble and Quorum. Ensemble is nothing but a cluster of Zookeeper servers, where in Quorum defines the rule to form a healthy Ensemble. Which is defined using a formula Q = 2N+1 where Q defines number of nodes required to form a healthy Ensemble which can allow N failure nodes. You will understand about this formula in the following example.
Before I start with an example, I want to define 2 things-
Cluster: Group of connected nodes/servers (now on will use node) with one node as Leader/Master and rest as Followers/Slaves.
Healthy Ensemble: A cluster with only one active Leader at any given point of time, hence fault tolerant.
Let me explain with an example, which is used commonly across while defining Ensemble and Quorum.
Lets say you have 1 zookeeper node. No need to worry here as we need more than 1 node to form a cluster.
Now take 2 nodes. There is no problem forming a cluster but there is problem to form a healthy Ensemble, because - Say the connection between these 2 nodes are lost, then both nodes will think the other node is down, so both of them try to act as Leader, which leads to inconsistency as they can't communicate with each other. Which means cluster of 2 nodes can't even afford even a single failure, so what is the use of this cluster??. They are not saying you can't make a cluster of 2 nodes, all they are saying is - it is same as having single node, as both don't allow even a single failure. Hope this is clear
Now take 3 nodes. There is no problem forming a cluster or healthy Ensemble - as this can allow 1 failure according the formula above 3 = 2N+1 => N = (3-1)/2 = 1. So when the next failure occurs (either connection or node failure), no node will be elected as Leader, hence the Ensemble won't serve any write/update/delete services, hence the states of the client cluster remains consistent across zookeeper cluster nodes. So the Leader election won't happen until there is majority nodes available and connected, where Majority m = (n/2)+1, where n stands for number of nodes available when the previous election happened. So here, 1st election happened with 3 nodes (as its a 3 node cluster). Then there was a 1st failure, so remaining 2 nodes can conduct election, as they have majority m = (3/2)+1 = 2. Then 2nd failure happened, now they don't have majority as there is only one node available for election, but the majority required is m = (2/2)+1 = 2.
Now take 4 nodes. There is no problem forming a cluster or healthy Ensemble, but having 4 nodes is same as 3 nodes, because both allows only 1 failure. Lets derive it from the Quorum formula 4 = 2N+1 => N = (4-1)/2 = ⌊1.5⌋ = 1 //floor(1.5)=1
Now take 5 nodes. There is no problem forming a cluster or healthy Ensemble - as this can allow 2 failure according the formula above 5 = 2N+1 => N = (5-1)/2 = 2.
Now take 6 nodes. There is no problem forming a cluster or healthy Ensemble, but having 6 nodes is same as 5 nodes, because both allows only 2 failure. Lets derive it from the Quorum formula 6 = 2N+1 => N = (6-1)/2 = ⌊2.5⌋ = 2
Conclusion:
To form a Quorum we need atleast 3 nodes - as 2 node cluster can't even handle single failure
Its good to form an Ensemble of odd number of nodes - as n (even number) nodes tends to allow same number of failure as of n-1 (odd number) nodes
Its not good to have more nodes, as they add latency into performance. Suggested Production cluster size is 5 - if one server is down for maintenance, it still can handle one more failure.
Ensemble is an array of nodes (or servers, if you like) that form your Distributed Computer Ecosystem.
Quorum is when things get interesting. On a specific assignment/job, Quorum ensures that a healthy leader-follower majority can be maintained. In other words, a conduct which ensures that majority vote can be obtained to proceed with an activity (e.g. commit/update/delete etc.). In Replication strategy, quorum is a must to have.
Lets try and use non-technical examples:
1) In your company - there is a board formed by 5 directors (ensemble).
|d1, d2, d3, d4, d5|----- BoD
2) Each director has equal say in each decision. But a majority if 3 directors at anytime should agree on a project. If no majority is there, the company will be dysfunctional.
3) One a particular project, P1 - they randomly voted to have a majority of d1,d2,d3 to be decision makers in the project. but d4 and d5 are fully aware of what's going on (so that they can step in anytime).
4) Now (God forbid), d3 passes away after a few months, again everyone agrees that the majority will be formed using d1,d2,d4. d5 is still aware of what's going on. NOte that we only have 4 directors left.
5) Disaster strikes again. d5 leaves the company for another competitor. But that doesn't change anything because the company is still functional with a 3-member BoD.
6) At any point of another disaster strikes the BoD and any of the directors become "unavailable" - company is dysfunctional i.e. we have lost the quorum forming criterion.
Zookeeper uses ceil(N/2) - 1 formula to get the maximum number of failures allowed for an Ensemble and maintain a stable quorum. In this case, the minimum recommended ensemble nodes are 3 (tolerates 1 failure maximum).
When you want to have high availability in zookeeper server you use multiple zookeeper servers to create an ensemble. Basically zookeeper has master-slave architecture. In an ensemble there will be one master and rest will be the slaves. If the master fails one of the slaves will act as a master.
The sequence in which a master is assigned is called as quorum. When you create an ensemble, zookeeper internally creates a sequence ID for the slave severs. When the main master fails it will check the next sequence ID to create a new master.
This concept of quorum also used while creating nodes in zookeeper.
ensemble: Numbers of nodes in the group.
Quorum: Number of required nodes to take the action.
Example: you have 5 nodes.
ensemble is 5. But according to majority rule Quorum should be 3. If we write no 3 nodes successfully, then we send success response to the client. Apache Zookeeper Quorum

Why do we need an 'arbiter' in MongoDB replication?

Assume we setup a MongoDB replication without arbiter, If the primary is unavailable, the replica set will elect a secondary to be primary. So I think it's kind of implicit arbiter, since the replica will elect a primary automatically.
So I am wondering why do we need a dedicated arbiter node? Thanks!
I created a spreadsheet to better illustrate the effect of Arbiter nodes in a Replica Set.
It basically comes down to these points:
With an RS of 2 data nodes, losing 1 server brings you below your voting minimum (which is "greater than N/2"). An arbiter solves this.
With an RS of even numbered data nodes, adding an Arbiter increases your fault tolerance by 1 without making it possible to have 2 voting clusters due to a split.
With an RS of odd numbered data nodes, adding an Arbiter would allow a split to create 2 isolated clusters with "greater than N/2" votes and therefore a split brain scenario.
Elections are explained [in poor] detail here. In that document it states that an RS can have 50 members (even number) and 7 voting members. I emphasize "states" because it does not explain how it works. To me it seems that if you have a split happen with 4 members (all voting) on one side and 46 members (3 voting) on the other, you'd rather have the 46 elect a primary and the 4 to be a read-only cluster. But, that's exactly what "limited voting" prevents. In that situation you will actually have a 4 member cluster with a primary and a 46 member cluster that is read only. Explaining how that makes sense is out of the scope of this question and beyond my knowledge.
Its necessary to have a arbiter in a replication for the below reasons:
Replication is more reliable if it has odd number of replica sets. Incase if there is even number of replica sets its better to add a arbiter in the replication.
Arbiters do not hold data in them and they are just to vote in election when there is any node failure.
Arbiter is a light weight process they do not consume much hardware resources.
Arbiters just exchange the user credentials data between the replica set which are encrypted.
Vote during elections,hearbeats and configureation data are not encrypted while communicating in between the replica sets.
It is better to run arbiter on a separate machine rather than along with any one of the replica set to retain high availability.
Hope this helps !!!
This really comes down to the CAP theorem whereby it is stated that if there are equal number of servers on either side of the partition the database cannot maintain CAP (Consistency, Availability, and Partition tolerance). An Arbiter is specifically designed to create an "imbalance" or majority on one side so that a primary can be elected in this case.
If you get an even number of nodes on either side MongoDB will not elect a primary and your set will not accept writes.
Edit
By either side I mean, for example, 2 on one side and 2 on the other. My English wasn't easy to understand there.
So really what I mean is both sides.
Edit
Wikipedia presents quite a good case for explaining CAP: http://en.wikipedia.org/wiki/CAP_theorem
Arbiters are an optional mechanism to allow voting to succeed when you have an even number of mongods deployed in a replicaset. Arbiters are light weight, meant to be deployed on a server that is NOT a dedicated mongo replica, i.e: the server's primary role is some other task, like a redis server. Since they're light they won't interfere (noticeably) with the system's resources.
From the docs :
An arbiter does not have a copy of data set and cannot become a
primary. Replica sets may have arbiters to add a vote in elections of
for primary. Arbiters allow replica sets to have an uneven number of
members, without the overhead of a member that replicates data.
http://docs.mongodb.org/manual/core/replica-set-arbiter/
http://docs.mongodb.org/manual/core/replica-set-elections/#replica-set-elections

MongoDB replication topograph and majority write for a two data centre setup

I have a fairly low concurrency application which is latency sensitive where the data has to be written across two data centers.
The database will have four physical hosts for redundancy; a primary and secondary in the main "A" data centre and a pair of hot standby hosts in the secondary "B" data centre. Call them AP, AS, BP, BS.
We want automated failover between AP and AS in the main data centre but manual promotion of hosts in the secondary data centre if we have to swap data centres.
We have additional hardware to run an arbitrator node in each data centre to ensure that master election is based on three hosts within a given data centre. The arbitrator in the B data centre will be offline. As we don't want the B side hosts to be promoted without manual intervention we can set their priority to zero.
We would like to achieve safe writes from java clients running in the A data centre that are confirmed to be on three of the four nodes: AP, AS, and BP.
Is a vanilla four node setup okay? If we loose the A data centre, servers BS and BP might not both be up to the same read. If we turn on the B side arbitrator, and increase the priorities on the B side to prefer BP as master can we expect BS and BP come up date with the latest reads?
Or to get data into AP, AS and BP should we setup a three data node replica set of AP, AS and BP that the clients will write to with a "w=3" error check and somehow chain BS off the back of BP? [edit: No specifying w=3 is dangerous when nodes are taken down for maintenance or if connectivity to the B side is lost.]
In a replica set the primary is selected via an election. This election follows certain rules and nodes have certain requirements in order to be "electable".
One of those requirements is that they must be most up-to-date to be elected as a primary.
What this means in your case is that if BP has some write that BS does not have, then BS cannot be elected as a primary. BP would have to be elected as the primary and BS will be syncing off of it (and will eventually catch up and get that write).
In reality, since you will be doing a fail-over manually, it's not really possible to end up in a scenario where BS doesn't have all the write BP has because while you are logging in and getting new arbiter set up BS will get all the writes from BP that it didn't have (because a secondary does not have to sync off of the primary, it can sync off of another secondary that is ahead of it).