Why does a mongodb replica set need an odd number of voting members? - mongodb

If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?

TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.

Related

MongoDB Replica Set - 5 data centers - are two arbiters possible?

Is it possible to deploy MongoDB Replica Set with two arbiters? The documentation states that a replica set should contain a maximum of one arbiter, but it doesn't specify if this is a real limitation or just a recommendation.
The planned deployment would be in five data centers, where three of the data centers would hold the actual data and run on high performance hardware, and the last two data centers would hold no data and only run arbiters.
The goal is to achieve high-availability and allow the system to operate even with the loss of any two data centers.
The other option is a 4+1 setup which would of course cost more, but would that really provide any benefits over 3+2 in this case?
More than one arbiter is not recommended, because arbiters are considered voting nodes. You need to consider what happens when the two data-bearing nodes go offline, where you are left with a primary and two arbiters. This is not ideal because:
Majority write concern will wait until writes propagate to the majority of voting nodes. Since the arbiters cannot do writes, the write will hang. This is especially a problem if the replica set is part of a sharded cluster, since chunk moves requires majority write concern.
Having two arbiters and a single active data-bearing nodes means that you don't have high availability anymore. If the primary is subsequently corrupted, you have no other node that have a copy of the data.
Losing two datacenters typically means that you have a more pressing issue compared to the database not allowing writes (e.g. reliability of your hosting company). You have to wonder if a hosting company allows two datacenter to be offline for an extended period of time, what are the chances of them corrupting your data?
If you envision that two of your data-bearing nodes can be offline at the same time (due to maintenance, disasters, etc.), then the best thing to do is to have a replica set with five data-bearing nodes.
If five data-bearing nodes is not ideal for your situation, I would recommend you to go with only one arbiter (the 4+1 topology you mentioned).

Requires simple explanation on Arbiter's role in a given mongoDB replica set

I came across MongoDB official site explaining on having odd number of members replica set up. I also heard of the term Arbiter from the same site, which based on my understanding, it will not be elected as primary and it does participate on election (from https://docs.mongodb.com/manual/core/replica-set-arbiter/).
There is also a post related to Arbiter in Why do we need an 'arbiter' in MongoDB replication? which then relates to CAP theorem, which further gets things more complicated.
First of all, why do we need to make the number of members odd? Also, can someone explain to me what this Arbiter is and what is its role in a given replica set in simple layman English??
Thanks in advance.
In short: it is to stop the two normal nodes of the replica set getting into a split-brain situation if they lose contact with each other.
MongoDB replica sets are designed so that, if one or more members goes down or loses contact, the other members are able to keep going as long as between them they have a majority. The majority clause is important: without that, you might have a situation where the network is split in two, and the nodes on each side of the partition think that they're still carrying on the replica set, and end up with different sets of data.
So to avoid the split brain problem, the nodes of a replica set will not continue if they can't command an absolute majority. An example of this is if you have two nodes, in a replica set like this:
If they lose communication, the outcome is symmetrical:
Each one will reason the same way:
realise it has lost communication with the other
assess whether it is possible to keep the replica set going
realise that 1 node (out of 2) does not constitute a majority
revert to Secondary mode
The difference an Arbiter makes
If there is a third node, then even if the two main nodes lose contact with each other then there will still be one of them in contact with the arbiter. This allows the two main nodes to make different decisions, and keep the replica set going while avoiding the split-brain problem.
Consider the following example of a 3-node replica set:
Whichever way the network partition goes, one node will still be in contact with the arbiter; for example like this:
Node A will:
realise it can contact neither node B nor the arbiter
assess whether it is possible to keep the replica set going
realise that 1 node (out of 3) does not constitute a majority
revert to Secondary mode
Whereas node B is able to react differently:
realise it cannot contact node A, but still has contact with the arbiter
assess whether it is possible to keep the replica set going
realise that 2 nodes (out of 3) do constitute a majority
take over as Primary
This also illustrates how you should deploy an arbiter to get that benefit:
try to put the arbiter on a system independent of both the data-bearing nodes, to maximise the chance of it still being able to communicate with either throughout network problems
it doesn't need to store data, so you don't need high-spec hardware
Just 1 arbiter is enough to break the deadlock; you don't get any benefit from multiple arbiters
Take the example of a 2-member replica set: in the event of a network-partitioning, i.e., the 2 members lost touch of each other, who gets to become the primary? There will be a tie and a need for a tie-breaker. That would not be the case if we have a 3-member replica set: the group that contains two nodes will win and one of them will become primary. That is the basis of the requirement for an odd number of nodes in a replica set. As for an arbiter, it happens to be light weight so that I guess one can save money by having in place a smaller machine, since we do not expect it to hold any data, and that we just need it to be present to vote for primary.

Mongodb ReplicaSet

You have just been hired at a new company with an existing MongoDB deployment. They are running a single replica set with two members. When you ask why, they explain that this ensures that the data will be durable in the face of the failure of either server. They also explain that should they use a readPreference of "primaryPreferred", that the application can read from the one remaining server during server maintenance.
You are concerned about two things, however. First, a server is brought down for maintenance once a month. When this is done, the replica set primary steps down, and the set cannot accept writes. You would like to ensure availability of writes during server maintenance.
Second, you also want to ensure that all writes can be replicated during server maintenance.
Which of the following options will allow you to ensure that a primary is available during server maintenance, and that any writes it receives will replicate during this time?
Check all that apply.
Add two arbiters.
Add another data bearing node.
Add two data bearing members plus one arbiter.
Add an arbiter.
Increase the priority of the first server from one to two.
This question regarding the majority and replication.
A parameter of the majority is minimum equal - 2/3. The replication this is a possibility to replicate data between at least two servers.
So, we have here the following picture:
Add two arbiters. (majority 3/4)
Add another data bearing node. (majority 2/3)
Add two data bearing members plus one arbiter. (majority 4/5)
Add an arbiter. (majority 2/3)
Increase the priority of the first server from one to two.(nothing for majority)
Now, you have to figure out the correct majority and think about which of the remaining variants will cover the replication.

Why ZooKeeper needs majority to run?

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.

mongoDB replication+sharding on 2 servers reasonable?

Consider the following setup:
There a 2 physical servers which are set up as a regular mongodb replication set (including an arbiter process, so automatic failover will work correctly).
now, as far as i understand, most actual work will be done on the primary server, while the slave will mostly just do work to keep its dataset in sync.
Would it be reasonable, to introduce sharding into this setup in a way that one would set up another replication set on the same 2 servers, so that each of them has one mongod process running as primary and one process running as secondary.
The expected result would be that both servers will share the workload of actual querys/inserts while both are up. In the case of one server failing the whole setup should elegantly fail over to continue running, until the other server is restored.
Are there any downsides to this setup, except the overall overhead in setup and number of processes (mongos/configservers/arbiters)?
That would definitely work. I'd asked a question in the #mongodb IRC channel a bit ago as to whether or not it was a bad idea to run multiple mongod processes on a single machine. The answer was "as long as you have the RAM/CPU/bandwidth, go nuts".
It's worth noting that if you're looking for high-performance reads, and don't mind writes being a bit slower, you could:
Do your writes in "safe mode", where the write doesn't return until it's been propagated to N servers (in this case, where N is the number of servers in the replica set, so all of them)
Set the driver-appropriate flag in your connection code to allow reading from slaves.
This would get you a clustered setup similar to MySQL - write once on the master, but any of the slaves is eligible for a read. In a circumstance where you have many more reads than writes (say, an order of magnitude), this may be higher performance, but I don't know how it'd behave when a node goes down (since writes may stall trying to write to 3 nodes, but only 2 are up, etc - that would need testing).
One thing to note is that while both machines are up, your queries are being split between them. When one goes down, all queries will go to the remaining machine thus doubling the demands placed on it. You'd have to make sure your machines could withstand a sudden doubling of queries.
In that situation, I'd reconsider sharding in the first place, and just make it an un-sharded replica set of 2 machines (+1 arbiter).
You are missing one crucial detail: if you have a sharded setup with two physical nodes only, if one dies, all your data is gone. This is because you don't have any redundancy below the sharding layer (the recommended way is that each shard is composed of a replica set).
What you said about the replica set however is true: you can run it on two shared-nothing nodes and have an additional arbiter. However, the recommended setup would be 3 nodes: one primary and two secondaries.
http://www.markus-gattol.name/ws/mongodb.html#do_i_need_an_arbiter