A Puzzle on PAXOS Algorithm for consensus - distributed-computing

Say, 5 nodes (with ids 1..5) contribute to a distributed system. During the process of reaching consensus, a proposal is suggested by node 4 with a proposal number 103. In similar fashion, a proposal is suggested by node 3 with a proposal number 111. Node 2 being a malicious node immediately makes another proposal with a proposal number 117. What will be the final state of the system after the consensus is reached (consider PAXOS as the underlying consensus algorithm)?
Proposal 103 from node 1 will be considered finally
Blocking proposal 117 and accepting proposal 111 finally
Proposal 117 from node 1 will be considered finally
Nodes 2, 3, and 5 will wait for sometime and one of them will become a proposer
According to me it is none of the above, but still I need some inputs to confirm if am missing any analysis in the PAXOS Algorithm.

Related

Legal Hierarchical Quorums in Zookeeper

I am trying to understand hierarchical quorums in Zookeeper. I may not understand the example shown in the documentation (here). Are votes [from at least two servers from each of two different groups] enough to form a legal quorum?
In my opinion, the example here does not gain the majority of all the weight; it only gains more than 4 ballots. A legal quorum should earn more than 5 ballots (9/2+1).
I also read the source code. The algorithm implementation is shown from line 352 to line 371. Zookeeper only checks if all groups have a majority and if the number of selected groups is larger than half of the group number.
Maybe I find the answer.
A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each server, then we are able to form quorums of size 4.
Note that two subsets of processes composed each of a majority of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect that a majority of co-locations will have a majority of servers available with high probability.

In Paxos, why can't we use random backoff to avoid collision?

I understand that the heart of Paxos consensus algorithm is that there is only one "majority" in any given set of nodes, therefore if a proposer gets accepted by a majority, there cannot be another majority that accepts a different value, given that any acceptor can only accept 1 single value.
So the simplest "happy path" of a consensus algorithm is just for any proposer to ping a majority of acceptors and see if it can get them to accept its value, and if so, we're done.
The collision comes when concurrent proposers leads to a case where no majority of nodes agrees on a value, which can be demonstrated with the simplest case of 3 nodes, and every node tries to get 2 nodes to accept its value but due to concurrency, every node ends up only get itself to "accept" the value, and therefore no majority agrees on anything.
Paxos algorithm continues to invent a 2-phase algorithm to solve this problem.
But why can't we just simply backoff a random amount of time and retry, until eventually one proposer will succeed in grabbing a majority opinion? This can be demonstrated to succeed eventually, since every proposer will backoff a random amount of time if it fails to grab a majority.
I understand that this is not going to be ideal in terms of performance. But let's get performance out of the way first and only look at the correctness. Is there anything I'm missing here? Is this a correct (basic) consensus algorithm at all?
The designer of paxos is a Mathematician first, and he leaves the engineering to others.
As such, Paxos is designed for the general case to prove consensus is always safe, irrespective of any message delays or colliding back-offs.
And now the sad part. The FLP impossibility result is a proof that any system with this guarantee may run into an infinite loop.
Raft is also designed with this guarantee and thus the same mathematical flaw.
But, the author of Raft also made design choices to specialize Paxos so that an engineer could read the description and make a well-functioning system.
One of these design choices is the well-used trick of exponential random backoff to get around the FLP result in a practical way. This trick does not take away the mathematical possibility of an infinite loop, but does make its likelihood extremely, ridiculously, very small.
You can tack on this trick to Paxos itself, and get the same benefit (and as a professional Paxos maintainer, believe me we do), but then it is not Pure Paxos.
Just to reiterate, the Paxos protocol was designed to be in its most basic form SO THAT mathematicians could prove broad statements about consensus. Any practical implementation details are left to the engineers.
Here is a case where a liveness issue in RAFT caused a 6-hour outage: https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/.
Note 1: Yes, I said that the Raft author specialized Paxos. Raft can be mapped onto the more general Vertical Paxos model, which in turn can be mapped onto the Paxos model. As can any system that implements consensus.
Note 2: I have worked with Lamport a few times. He is well aware of these engineering tricks, and he assumes everyone else is, too. Thus he focuses on the math of the problem in his papers, and not the engineering.
The logic you are describing is how leader election is implemented in Raft:
when there is no leader (or leader goes offline) every node will have a random delay
after the random delay, the node will contact every other node and propose "let me be the leader"
if the node gets the majority of votes, then the node considers itself the leader: which is equivalent of saying "the cluster got the consensus on who is the leader"
if the node did not get the majority, then after a timeout and a random delay, the node will attempt again
Raft also has a concept of term, but on a high level, the randomized waits is the feature with helps to get to consensus faster.
Answering your questions "why can't we..." - we can, it will be a different protocol.

In paxos, what happens if a proposer is down after its proposal is rejected?

In this figure, the proposal of X is rejected.
At the end of the timeline, S1 and S2 accept X while S3, S4 and S5 accept Y. Proposer X is now supposed to re-send the proposal with value Y.
But what happens if proposer X gets down at that time? How does S1 and S2 eventually learn the value Y?
Thanks in advance!
It is a little hard to answer this from the fragment of a diagram that you've shared since it is not clear what exactly it means. It would be helpful if you could link to the source of that diagram so we can see more of the context of your question. The rest of this answer is based on a guess as to its meaning.
There are three distinct roles in Paxos, commonly known as proposer, acceptor and learner, and I think it aids understanding to divide things into these three roles. The diagram you've shared looks like it is illustrating a set of five acceptors and the messages that they have sent as part of the basic Synod algorithm (a.k.a. single-instance Paxos). In general there's no relationship between the sets of learners and acceptors in a system: there might be a single learner, or there might be thousands, and I think it helps to separate these concepts out. Since S1 and S2 are acceptors, not learners, it doesn't make sense to ask about them learning a value. It is, however, valid to ask about how to deal with a learner that didn't learn a value.
In practical systems there is usually also another role of leader which takes responsibility for pushing the system forward using timeouts and retries and fault detectors and so on, to ensure that all learners eventually learn the chosen value or die trying, but this is outside the scope of the basic algorithm that seems to be illustrated here. In other words, this algorithm guarantees safety ("nothing bad happens") but does not guarantee liveness ("something good happens"). It is acceptable here if some of the learners never learn the chosen value.
The leader can do various things to ensure that all learners eventually learn the chosen value. One of the simplest strategies is to get the learned value from any learner and broadcast it to the other learners, which is efficient and works as long as there is at least one running learner that's successfully learned the chosen value. If there is no such learner, the leader can trigger another round of the algorithm, which will normally result in the chosen value being learned. If it doesn't then its only option is to retry, and keep retrying until eventually one of these rounds succeeds.
In this figure, the proposal of X is rejected.
My reading of the diagram is that it is an ”accept request” that is rejected. Page 5 paragraph 1 of Paxos Made Simple describes this message type.
Proposer X is now supposed to re-send the proposal with value Y.
The diagram does not indicate that. Only if Y was seen in response to the blue initial propose messages would the blue proposer have to choose Y. Yet the blue proposer chose X as the value in its ”accept request”. If it is properly following Paxos it could not have ”seen Y” in response to it's initial proposal message. If it had seen it then it must have chosen it and so it wouldn’t have sent X.
In order to really know what is happening you would need to know what responses were seen by each proposer. We cannot see from the diagram what values, if any, were returned in response to the first three blue propose messages. We don’t see in the diagram whether X was previously accepted at any node or whether it was not. We don't know if the blue proposer was ”free to choose” it's own X or had to use an X that was already accepted at one or more nodes.
But what happens if proposer X gets down at that time?
If the blue proposer dies then this is not a problem. The green proposer has successfully fixed the value Y at a majority of the nodes.
How does S1 and S2 eventually learn the value Y?
The more interesting scenario is what happens if the green proposer dies. The green proposer may have sent it's accept request messages containing Y and immediately died. As three of the messages are successful the value Y has been fixed yet the original proposer may not be alive to see the accept response messages. For any further progress to be made a new proposer needs to send a new propose message. As three of the nodes will reply with Y the new proposer will chose Y as the value of it's accept request message. This will be sent to all nodes and if all messages get through, and no other proposer interrupts, then S1 and S2 will become consistent.
The essence of the algorithm is collaboration. If a proposer dies the next proposer will collaborate and chose the highest value previously proposed if any exists.

consensus algorithm: what will happen if an odd cluster becomes even because of a node failure?

Consensus algorithm (e.g. raft) requires the cluster contains an odd number of nodes to avoid the split-brain problem.
Say I have a cluster of 5 nodes, what will happen if only one node fails? The cluster has 4 nodes now, which breaks the odd number rule, will the cluster continue to behave right?
One solution is to drop one more node to make the cluster contain only 3 nodes, but what if the previously failed node comes back? then the cluster has 4 nodes again, and we have to bring the afore-dropped node back in order to keep the cluster odd.
Do implementations of the consensus algorithm handle this problem automatically, or I have to do it in my application code (for example, drop a node)?
Yes, the cluster will continue to work normally. A cluster of N nodes, and if N is odd (N = 2k + 1), can handle k node fails. As long as a majority of nodes is alive, it can work normally. If one node fails, and we still have the majority, everything is fine. Only when you lose majority of nodes, you have a problem.
There is no reason to force the cluster to have an odd number of nodes, and implementations don't consider this as a problem and thus don't handle it (drop nodes).
You can run a consensus algorithm on an even number of nodes, but it usually makes more sense to have it odd.
3 node cluster can handle 1 node fail (the majority is 2 nodes).
4 node cluster can handle 1 node fail (the majority is 3 nodes).
5 node cluster can handle 2 node fail (the majority is 3 nodes).
6 node cluster can handle 2 node fail (the majority is 4 nodes).
I hope this makes it more clear why it makes more sense to have the cluster size to be an odd number, it can handle the same number of node failures with fewer nodes in the cluster.

Hierarchical quorums in Zookeeper

I am trying to understand hierarchical quorums in Zookeeper. The documentation here
gives an example but I am still not quiet sure I understand it. My question is, if I have a two node Zookeeper cluster (I know it is not recommended but let's consider it for the sake of this example)
server.1 and
server.2,
can I have hierarchical quorums as follows:
group.1=1:2
weight.1=2
weight.2=2
With the above configuration:
Even if one node goes down I still have enough votes (?) to
maintain a quorum ? is this a correct statement ?
What is the zookeeper quorum value here (2 - for two nodes or 3 -
for 4 votes)
In a second example, say I have:
group.1=1:2
weight.1=2
weight.2=1
In this case if server.2 goes down,
Should I still have sufficient votes (2) to maintain a quorum ?
As far as I understand from the documentation, When we give weight to a node, then the majority varies from being the number of nodes. For example, if there are 10 nodes and 3 of the nodes have been given 70 percent of weightage, then it is enough to have those three nodes active in the network. Hence,
You don't have enough majority since both nodes have equal weight of 2. So, if one node goes down, we have only 50 percent of the network being active. Hence quorum is not achieved.
Since total weight is 4. we require 70 percent of 4 which would be 2.8 so closely 3, since we have only two nodes, both needs to be active to meet the quorum.
In the second example, it is clear from the weights given that 2/3 of the network would be enough (depends on the configuration set by us, I would assume 70 percent always,) if 65 percent is enough to say that network is alive, then the quorum is reached with one node which has weightage 2.