What should be the majority in an ensemble for Zookeeper - apache-zookeeper

I am trying to understand Zookeeper using this book - Zookeeper By Flavio Junqueira, Benjamin Reed, it is mentioned that we need to select a majority of servers for quorum as stated here:
Say that we use four servers for an ensemble. A majority of servers is
comprised of three servers. However, this system will only tolerate a
single crash, because a double crash makes the system lose majority.
Consequently, with four servers, we can only tolerate a single crash,
but quorums now are larger, which implies that we need more
acknowledgments for each request. The bottom line is that we should
always shoot for an odd number of servers.
Please help me in understanding this.
How do we select the majority of servers for a given ensemble?
Why does this statement say quorums now are larger and why do we need more acknowledgments for each request?

It just means that more servers should be up than down where each server in the ensemble should be accounted for, or that more servers have acknowledged message receipt than those that have not. With 4 servers you need 3 servers to be up to satisfy that condition, with 3, only 2. In each instance you can only tolerate the failure of one server for the cluster to still be up. The 4 node cluster is worse because you now have an extra server that is essentially not making your cluster any more fault tolerant than just a 3 node one.
Additionally, if you had 3 nodes, you would require just 2 acknowledgements to meet the quorum requirement. With 4, you need 3 acks. That would lead to a slower cluster. That's what the ' Consequently, with four servers...' statement means.

Related

Why only n failed servers are allowed in multi-node Zookeper ensembles?

I am reading through the theory for Apache Kafka and came across Zookeeper quorum allowance. I wanted to know why only n failed servers are allowed to keep a quorum? If we are using 5 servers, then why not allow 3 servers to fail and still not let Zookeeper go down? We are left with 2 servers here, which is the same if we use 3 server configuration and allow one to fail? Another question, if we allow 1 to fail in 3 server configuration, then isn't the odd number rule voilated? Or this odd number rule is for general case and we randomly select an output from the two in case of a clash?
Zookeeper won't "go down" with an even number. It'll just be confused because quorum, majority rule of (N+1) / 2 servers, cannot be reached amongst the remaining servers. e.g. With 3 brokers, 2 in agreement (not two in total, giving different information to each other) are needed for a quorum, or with 5, then 3 are needed

ActiveMQ Artemis cluster failover questions

I have a question in regards to Apache Artemis clustering with message grouping. This is also done in Kubernetes.
The current setup I have is 4 master nodes and 1 slave node. Node 0 is dedicated as LOCAL to handle message grouping and node 1 is the dedicated backup to node 0. Nodes 2-4 are REMOTE master nodes without backup nodes.
I've noticed that clients connected to nodes 2-4 is not failing over to the 3 other master nodes available when the connected Artemis node goes down, essentially not discovering the other nodes. Even after the original node comes back up, the client continues to fail to establish a connection. I've seen from a separate Stack Overflow post that master-to-master failover is not supported. Does this mean for every master node I need to create a slave node as well to handle the failover? Would this cause a two instance point of failure instead of however many nodes are within the cluster?
On a separate basic test using a cluster of two nodes with one master and one slave, I've observed that when I bring down the master node clients are connected to, the client doesn't failover to the slave node. Any ideas why?
As you note in your question, failover is only supported between a live and a backup. Therefore, if you wanted failover for clients which were connected to nodes 2-4 then those nodes would need backups. This is described in more detail in the ActiveMQ Artemis documentation.
It's worth noting that clustering and message grouping, while technically possible, is a somewhat odd pairing. Clustering is a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group (to maintain message order) which then decreases overall message throughput (perhaps severely depending on the use-case). A single ActiveMQ Artemis node can potentially handle millions of messages per second. It may be that you don't need the increased message throughput of a cluster since you're grouping messages.
I've often seen users simply assume they need a cluster to deal with their expected load without actually conducting any performance benchmarking. This can potentially lead to higher costs for development, testing, administration, and (especially) hardware, and in some use-cases it can actually yield worse performance. Please ensure you've thoroughly benchmarked your application and broker architecture to confirm the proposed design.

Why does a mongodb replica set need an odd number of voting members?

If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?
TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.

Why ZooKeeper needs majority to run?

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.

ZooKeeper reliability - three versus five nodes

From the ZooKeeper FAQ:
Reliability:
A single ZooKeeper server (standalone) is essentially a coordinator with
no reliability (a single serving node failure brings down the ZK service).
A 3 server ensemble (you need to jump to 3 and not 2 because ZK works
based on simple majority voting) allows for a single server to fail and
the service will still be available.
So if you want reliability go with at least 3. We typically recommend
having 5 servers in "online" production serving environments. This allows
you to take 1 server out of service (say planned maintenance) and still
be able to sustain an unexpected outage of one of the remaining servers
w/o interruption of the service.
With a 3-server ensemble, if one server is taken out of rotation and one server has an unexpected outage, then there is still one remaining server that should ensure no interruption of service. Then why the need for 5 servers? Or is it more than just interruption of service that is being considered?
Update:
Thanks to #sbridges for pointing out that it has to do with maintaining a quorum. And the way that ZK defines a quorum is ceil(N/2) where N is the original number in the ensemble (and not just the currently available set).
Now, a google search for ZK quorum finds this in the HBase book chapter on ZK:
In ZooKeeper, an even number of peers is supported, but it is normally
not used because an even sized ensemble requires, proportionally, more
peers to form a quorum than an odd sized ensemble requires. For
example, an ensemble with 4 peers requires 3 to form a quorum, while
an ensemble with 5 also requires 3 to form a quorum. Thus, an ensemble
of 5 allows 2 peers to fail and still maintain quorum, and thus is more
fault tolerant than the ensemble of 4, which allows only 1 down peer.
And this paraphrasing of Wikipedia in Edward J. Yoon's blog:
Ordinarily, this is a majority of the people expected to be there,
although many bodies may have a lower or higher quorum.
Zookeeper requires that you have a quorum of servers up, where quorum is ceil(N/2). For a 3 server ensemble, that means 2 servers must be up at any time, for a 5 server ensemble, 3 servers need to be up at any time.
Basically, Zookeeper will work just fine as long as Active Zookeepers are in MAJORITY compared to failed Zookeepers.
Also, in case of even quorum size i.e 2,4,6 etc. Failed = Active, because of that its not recommended.
Both 3 and 4 will handle only 1 faliures then why whould we want to used 4 Zookeepers instead of 3.