Why RAFT protocol rejects RequestVote with lower term? - consensus

In raft every node rejects any request with term number lower than it's own. But why do we need this for RequestVote rpc? If Leader Completeness Property holds, then node can vote for this candidate, right? So why reject request? I can't come up with example, where without this additional check when receiving RequestVote we can lose our consistency, safety, etc.
Maybe someone can help pls?

A candidate with lower term in RequestVote RPC has a log that could be not up to date with other nodes because a leader could be elected in a higher term and already replicated an entry to a majority of servers.
If this candidate is elected leader, the rules of raft cause this leader to not be able to do anything for safety reasons. Its RPCs will be rejected by other servers due to lower term.

I think there are these reasons.
Term in Raft is monotonically increasing. All requests from previous terms are requests made based on stale state and will only be rejected and returned with the current term and leader information.
Any elected leader must have all committed logs before the election, a candidate from the previous term is unlikely to have all committed logs since it won't have any committed logs from the current term.

Related

When does the Raft consensus algorithm apply the commit log to the leader and followers?

In RAFT, I understand that a leader receives a request and federates it out to it's peer list to commit to their respective logs.
My question is, is there a distinction between committing the action to the log and actually applying the action? If the answer is yes, then at what point does the action get applied?
My understanding is once the leaders receives, from the majority - "hey I wrote this to my log", the leader applies the change then federates an "Apply" command to the peers that wrote the change to their respective logs and then the ack is sent to the client.
I would say there is a distinction between committing an entry and applying it to the state machine. Once an entry is committed (i.e. the commitIndex is >= the entry index) it can be applied at any time. In practice, you want the leader to apply committed entries as soon as possible to reduce latency, so entries will usually be applied to an in-memory state machine immediately.
In the case of in-memory state machines the distinction is not very obvious. But it’s the other use cases for Raft that do necessitate this distinction. For example, the distinction becomes particularly important with persistent state machines. If the state machine is persisting changes to e.g. an underlying database, it’s critical that each entry only be applied to the state machine once so that the underlying store does not go back in time when the node replays entries to the state machine when recovering from a failure. To make persistent state machines idempotent, the state machine on each node needs to persist the entries that have been applied on that node as part of the persistent state. In this case, the process of applying entries is indeed a distinction, and a critical one.
State machine replication is also only one use case for Raft. There are others as well. It’s perfectly valid, for example, to use the protocol for simple log replication, in which case entries wouldn’t be applied at all.

Crash Fault Tolerance via Heartbeat

I get the concept of Crash Fault Tolerance (CTF) in theory. CTF is used to guarantee that the system is still running even if the leader server is crashing.
I need to implement a distributed system (chat application) and also need to implement a crash fault tolerance. For this I have to use so-called "heartbeat" to check if the leader server is still "living".
My question is if someone could show me a good code example to implement such a heartbeat?
The heartbeat mechanism applicability depends on the size of cluster or the typical use-case / deployment scenario you have in hand.
Many consensus based algorithms rely on heartbeats as that is used to decide on the status of leader or leader server.
The raft algorithm can be referred where heartbeats are sent from leader server to followers and you can also use their leader election mechanism in case leader crashes.
For large clusters, only heartbeat mechanism might not scale and hence failure detectors along with gossip based protocols for propagation is preferred that can be referred.
Few references :
https://raft.github.io/ ,
https://github.com/topics/gossip-protocol

How do replicas coming back online in PAXOS or RAFT catch up?

In consensus algorithms like for example PAXOS and RAFT, a value is proposed, and if a quorum agrees, it's written durably to the data store. What happens to the participants that were unavailable at the time of the quorum? How do they eventually catch up? This seems to be left as an exercise for the reader wherever I look.
Take a look at the Raft protocol. It’s simply built in to the algorithm. If the leader tracks the highest index (matchIndex) and the nextIndex to be sent to each follower, and the leader always sends entries to each follower starting at that follower’s nextIndex, there is no special case needed to handle catching up a follower that was missing when the entry was committed. By its nature, when the restarts, the leader will always begin sending entries to that follower starting with the last entry in its log. Thus the node is caught up.
This is mentioned in Paxos made Simple:
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
And also in Raft paper:
The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully.
With the original Paxos papers, it is indeed left as an exercise for the reader. In practice, with Paxos you can send additional messages such as negative acknowledgements to propagate more information around the cluster as a performance optimisation. That can be used to let a node know that it is behind due to lost messages. Once a node knows that it is behind it needs to catch up which can be done with additional message types. That is described as Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos.
The Google Chubby paxos paper Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it. Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust. The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe leader election and multi-Paxos. Unfortunately, Paxos is still taught in a theoretical manner, not in a modern style which leads people to think that it is hard and miss the essence of it.
I argue that Paxos is simple but that reasoning about failures in a distributed system and testing to find any bugs is hard. Everything that is left the reader in the original papers doesn't affect correctness but does effect latency, throughput and the complexity of the code. Once you understand what makes Paxos correct as it is mechanically simple it makes it straightforward to write the rest of what is needed in a way that doesn't violate consistency when you optimise the code for your use case and workload.
For example, Corfu and CURP give blisteringly high performance, one uses Paxos only for metadata, the other only needs to do Paxos when there are concurrent writes to the same keys. Those solutions don't directly complete with Raft or Multi-Paxos as they solve for specific high-performance scenarios (e.g., k-v stores). Yet they demonstrate that it's worth understanding that for practical applications there is a huge amount of optimisations you can make if your particular workload will let you while still using Paxos for some part of the overall solution.

How can a node with complete log can be elected if another becomes a candidate first?

I've been watching Raft Algorithm video at https://youtu.be/vYp4LYbnnW8?t=3244, but am not clear about one circumstance.
In leader election for term 4, if node s1 broadcasts RequestVote before s3 does then node s2, s4 and s5 would vote for it, while s3 doesn't. And then node s3 broadcasts RequestVote to others, how can it get the vote of others?
One possible way to handle the situation I can figure out is:
if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
In both scenarios, eventually node s3 gets all others' votes, and sets itself as leader. I'm not sure if my guess is correct.
(Before I comment, be aware that there is NO possible way for entry #9 to be committed. There is no indication of which log entries are committed, but this discussion works with any of #s 1-8 as being committed.)
In short, s3 does not become the leader, s1 does because it gets a majority of the votes. If your concern is that entry #9 will be lost, that is true, but it wasn't committed anyway.
From §5.3:
In Raft, the leader handles inconsistencies by forcing
the followers’ logs to duplicate its own. This means that
conflicting entries in follower logs will be overwritten
with entries from the leader’s log.
To comment on your handling of the situation.
1, if node s1 receives the rejection from s3, and found out s3's log is newer than itself, and do not set itself as leader even though it receives majority of votes
It could do this, but it will make failover take longer because s3 would have to try again with a different timeout, and you come into a race condition where s1 always broadcasts RequestVote before s3 does. But again, it is always safe to delete the excess entries that s3 has.
The last paragraph of §5.3 talks about how this easy, timeout-based election process was used instead of ranking the nodes and selecting the best. I agree with the outcome. Simpler protocols are more robust.
2, As to other nodes, they remember the leader information they voted, if a new vote request comes (with bigger <lastTerm, lastIndex>), they vote for the node with bigger <lastTerm, lastIndex>.
This is strictly forbidden because it destroys leader election. That is, if you have this in place you will very often elect multiple leaders. This is bad. I cannot stress enough how bad this is. Bad, bad, bad.

Why is Chubby lockserver not multi-master?

As I understand Chubby at any given time there are 5 chubby servers. One is the master and handles coordination of writes to the quorum, and the other 4 servers are read only and forward handling of writes to the master. Writes use Paxos to maintain consistency.
Can someone explain to me why there is a distinction between the master and the 4 replicas. Why isn't Chubby multi-master? This question could also apply to Zookeeper.
Having a single master is more efficient because nodes don't have to deal with as much contention.
Both Chubby and Zookeeper implement a distributed state-machine where the point of the system is to decide a total ordering on transitions from one state to the next. It can take a lot of messages (theoretically infinite messages) to resolve a contention when multiple nodes are proposing a transition at the same time.
Paxos (and thus Chubby) uses an optimization called a "distinguished leader" where the replicas forward writes to the distinguished leader to reduce contention. (I am assuming Chubby replicas do this. If not they could, but the designers merely push that responsibility to the client.) Zookeeper does this too.
Both Chubby and Zookeeper actually do handle multiple leaders because they must deal with a leader that doesn't know it has died and then comes back from the dead. For Chubby, this is the whole point of using Paxos: eventually one of the leaders will win. (Well theoretically it may not, but we engineers do practical things like randomized backoff to make that probability tolerably small.) Zookeeper, on the other hand, assigns a non-decreasing id to each leader; and any non-current leader's transitions are rejected. This means that when a leader dies, Zookeeper has to pause and go through a reconfiguration phase before accepting new transitions.