What happens to uncommitted previous term log entries in Raft algorithm? - consensus

There are a number of questions here on StackOverflow around Figure 8, discussed in section 5.4.2 in the original Raft paper:
Figure 8
What has not been made clear by the paper and by none of the answers is the exact fate of that problematic entry (2, 3). My question is two-fold:
What exactly happens to entry at index 2 during term 3 (2, 3), made by S5? The figure mentions that S5 will not become a leader because majority will reject its RequestVotes. Does that mean that upon receiving AppendEntries RPC, S5 will then overwrite its entry (2, 3) with (2, 2) and (3, 4) as per current leader in (e)?
If S5 is forced to overwrite this entry, and it is never committed, what response should the client that has sent (1, 3) receive? Do the clients receive acknowledgements for uncommitted entries as if they were already applied to a state-machine?

The figure mentions that S5 will not become a leader because majority
will reject its RequestVotes
As in (e) in raft paper, S5 will not become a leader because the logs of S5 is not at least up to date with the logs of majority (S1,S2,S3)
Does that mean that upon receiving AppendEntries RPC, S5 will then
overwrite its entry (2, 3) with (2, 2) and (3, 4) as per current
leader in (e)?
Yes, the logs of S5 will be overwritten by the logs of the current leader. Quoted from raft paper:
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
Do the clients receive acknowledgements for uncommitted entries as if
they were already applied to a state-machine?
No, the clients only receive acknowledgements for committed entries when the entry has been safely replicated. Please see a quote from raft paper:
When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that
execution to the client.
There is also a case when the leader has replicated the log entry but crashes before responding to client or the response is lost when being sent over the network, the client needs to retry causing the command to be executed multiple times. Quouted from raft paper:
However, as described so far Raft can execute a command multiple times: for example, if the leader crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader, causing it to be executed a second time. The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest
serial number processed for each client, along with the associated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request


How does raft preserve safty when a leader commits a log entry and crashes before informing followers this commitment?

In my understanding, a leader sends AppendEntries RPC to the followers, and if majority of followers return success, the leader will commit this entry. It will commit this entry by applying it to its own state machine, and it will also return to the client to let the client know that the command is successful.
However, at this time, this commitment is not known to the followers yet. It will inform the followers in the next AppendEntries (or heartbeat) RPC calls.
In the simplest case, if the leader crashes after the commitment and before the next AppendEntries, raft will use the "only most up to date follower can win" strategy to ensure that the next leader must contain this log entry (although not committed), and the new leader will commit this entry and send AppendEntries to other followers. In this way, the log entry is safely kept.
However, consider the following complicated scenario (extracted from PHD thesis "CONSENSUS: BRIDGING THEORY AND PRACTICE" page 23).
At this point, the log entry from term 2 has been replicated on a
majority of the servers, but it is not committed. If S1 crashes as in
(d1), S5 could be elected leader (with votes from S2, S3, and S4) and
overwrite the entry with its own entry from term 3.
How if at this point, it is committed in Server S1, but not committed in other servers yet? If S1 then crashes as in (d1), this log entry will be overwritten by S5?
In my understanding, a committed entry (applied to state machine and possibly informed the client about the result) shall never be overwritten?
Did I misunderstand anything of the raft protocol?
There are more conditions in Raft to commit an entry.
On page 4 of this paper (The 1-page summary of raft) it says
If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm set commitIndex = N (§5.3, §5.4).
In other words, not only does the entry have to be replicated to a majority, its term has to be from the current term. This is why in a practical system a new leader will propose a no-op so that it can advance this commitIndex.
So now we know the leader won't commit until then, but what if it commits but doesn't send the commit.
Later in section 5.4.1 of the same paper it says (emphasis mine):
Raft...guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election....Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster
in order to be elected, which means that every committed entry must be present in at least one of those servers.
In short, the new leader by definition must have the entries that the old leader considered committed.

How to handle reordered RPC in raft

When implementing the Raft algorithm, I found there is a situation that I think may or may not do harm to the cluster.
It is reasonable to assume some AppendEntriesRPC from Leader are received reordered(network delay or other reasons). Consider the Leader send a heartbeat AppendEntriesRPC to peer A, with prev_log_index = 1, and then send another AppendEntriesRPC with entry 2, and then it crash(I ensure this happen immediately by a callback in my test). If the two RPCs are handled in the order which they are sent, entry 2 will be inserted successfully. However, if the heartbeat RPC is delayed, then peer A will firstly insert entry 1 and respond to the Leader. Then comes the delayed heartbeat, peer A will erase entry 2, because the entry conflict with the Leader's prev_log_index = 1. So peer A erases a log entry by mistake.
To dig a little deeper, if the Leader doesn't crash immediately, will it fix this? I think if peer A respond to the delayed heartbeat correctly, the Leader will find out and fix it up in some later RPCs.
However, what if peer A's response to entry 2 lead to the commit_index advancing? In this case peer A vote to advance commit_index to 2, even though it actually does not have entry 2. So there may not enough votes for this advancing. When the Leader crashs now, a node with less logs will be elected as Leader. And I do encounter such situation during my testing.
My question is:
Is my reasoning correct?
If reordered RPC a real problem, how should I solve that? Is indexing and caching all RPCs, and force them be handled one by one a good solution? I found it hard to implement in gRPC.
Raft assumes an ordered stream protocol such as TCP. That is, if a message arrives out of order then it is buffered until its predecessor arrives. (This behavior is why TCP exists: because each individual packet can go through separate routes between servers and there is a high chance of out-of-order messages, and most applications prefer the ease-of-mind of a strict ordering.)
Other protocols, such as plain old Paxos, can work with out-of-order messages, but are typically much slower than Raft.

How ZooKeeper guarantees "Single System Image"?

In the Consistency Guarantees section of ZooKeeper Programmer's Guide, it states that ZooKeeper will give "Single System Image" guarantees:
A client will see the same view of the service regardless of the server that it connects to.
According to the ZAB protocol, only if more than half of the followers acknowledge a proposal, the leader could commit the transaction. So it's likely that not all the followers are in the same status.
If the followers are not in the same status, how could ZooKeeper guarantees "Single System Status"?
ZooKeeper’s atomic broadcast protocol: Theory and practice
Single System Image
Leader only waits for responses from a quorum of the followers to acknowledge to commit a transaction. That doesn't mean that some of the followers need not acknowledge the transaction or can "say no".
Eventually as the rest of the followers process the commit message from leader or as part of the synchronization, will have the same state as the master (with some delay). (not to be confused with Eventual consistency)
How delayed can the follower's state be depends on the configuration items syncLimit & tickTime (https://zookeeper.apache.org/doc/current/zookeeperAdmin.html)
A follower can at most be behind by syncLimit * tickTime time units before it gets dropped.
The document is a little misleading, I have made a pr.
see https://github.com/apache/zookeeper/pull/931.
In fact, zookeeper client keeps a zxid, so it will not connect to older follower if it has read some data from a newer server.
All reads and writes go to a majority of the nodes before being considered successful, so there's no way for a read following a write to not know about that previous write. At least one node knows about it. (Otherwise n/2+1 + n/2+1 > n, which is false.) It doesn't matter if many (at most all but one) has an outdated view of the world since at least one of them knows it all.
If enough nodes crash or the network becomes partitioned so that no group of nodes that are able to talk to each other are in a majority, Zab stops handling requests. If your acknowledged update gets accepted by a set of nodes that disappear and never come back online, your cluster will lose some data (but only when you ask it to move on, and leave its dead nodes behind).
Handling more than two requests is done by handling them two at a time, until there's only one state left.

How are out-of-order and wait-free writes handled?

As stated in Guarantees:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Let's assume a client makes 2 updates (update1 and update2) in a very short time window (I understand zookeeper is good at read-domination applications). So my questions are:
Is that possible update2 is received before update1, therefore for zookeeper update1 has later stamp than that of update2? I assume yes due to network connection nature. If this the case that means client will lose its update2 and will have update1. Is there anyway zookeeper can ACK back the client with different stamp or whatever other data that let the client to determine if update2 is really received after update1. Basically zookeeper tells what it sees from server side to client, which gives client some info to act if that's not what the client wants.
What if there is a leader failure after receiving and confirming update1 and before receiving update2? I assume such writes are persisted somewhere in disk/DB etc. When the new leader comes back will it catch up first, meaning conduct update1, before confirming update2 back to client?
Just curious, since zookeeper claims it supports wait-free writing, does that mean there is a message queue built inside zookeeper to hold incoming writes? Otherwise if the leader has to make sure the update is populated to all other followers, the client is actually being blocked by during this replication process. I am guessing that's part of reason zookeeper does not support heavy write application.
For the first two questions, I think you can find details in Zookeeper's paper.
It's quite normal that different operations from the same client arrive in disorder to Zookeeper node. But Zookeeper use TCP to ensure that sequential network package will be receive orderly.
Leader must write operations in Write-Ahead-Log before it can confirm operations. The problems will diverge in two dimensions. The first situation we should consider is whether the leader could recover before followers realize leader failure. If yes, nothing bad will happen, all operations in failure time will lost, and client will resend the operations. If not, then we should consider whether the Leader has proposed a proposal before it fails. If it fails before proposing a proposal, then client will know the failure. If it has proposed a proposal, there must be at least one node in the cluster which has got the newest transactions. Then it will be the new Leader in next rolling. When the original Leader recovers from failure, it will realize he's no longer the leader(All transactions of Zookeeper contains a 64-bits transaction id, of which the higher 32 bits represent epoch, and the lower 32 bits represents proposal id). It will communicate with new Leader and then get updated(Sometimes it need truncate it's local transaction log first).
I don't know the details since I haven't read ZooKeeper's source code. But Leader only needs over half acknowledge from followers before it response to clients. Zookeeper provide both blocking and non-blocking API and you can choose what you like.

What to do if the leader fails in Multi-Paxos for master-slave systems?

In section 3, named Implementing a State Machine, of Lamport's paper Paxos Made Simple, Multi-Paxos is described. Multi-Paxos is used in Google Paxos Made Live. (Multi-Paxos is used in Apache ZooKeeper). In Multi-Paxos, gaps can appear:
In general, suppose a leader can get α commands ahead--that is, it can propose commands i + 1 through i + α commands after commands 1 through i are chosen. A gap of up to α - 1 commands could then arise.
Now consider the following scenario:
The whole system uses master-slave architecture. Only the master serves client commands. Master and slaves reach consensus on the sequence of commands via Multi-Paxos. The master is the leader in Multi-Paxos instances. Assume now the master and two of its slaves have the states (commands have been chosen) shown in the following figure:
Note that, there are more than one gaps in the master state. Due to asynchrony, the two slaves lag behind. At this time, the master fails.
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
Update about Zab:
As #sbridges has pointed out, ZooKeeper uses Zab instead of Paxos. To quote,
Zab is primarily designed for primary-backup (i.e., master-slave) systems, like ZooKeeper, rather than for state machine replication.
It seems that Zab is closely related to my problems listed above. According to the short overview paper of Zab, Zab protocol consists of two modes: recovery and broadcast. In recovery mode, two specific guarantees are made: never forgetting committed messages and letting go of messages that are skipped. My confusion about Zab is:
In recovery mode does Zab also suffer from the gaps problem? If so, what does Zab do?
The gap should be the Paxos instances that has not reached agreement. In the paper Paxos Made Simple, the gap is filled by proposing a special “no-op” command that leaves the state unchanged.
If you cares about the order of chosen values for Paxos instances, you'd better use Zab instead, because Paxos does not preserve causal order. https://cwiki.apache.org/confluence/display/ZOOKEEPER/PaxosRun
The missing command should be the Paxos instances that has reached agreement, but not learned by learner. The value is immutable because it has been accepted a quorum of acceptor. When you run a paxos instance of this instance id, the value will be chosen and recovered to the same one on phase 1b.
When slaves/followers detected a failure on Leader, or the Leader lost a quorum support of slaves/follower, they should elect a new leader.
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO.
In recovery mode, after the leader is elected, the follower synchronize with leader first, and apply the modification on state until NEWLEADER is received.
In broadcast mode, the follower queues the PROPOSAL in pendingTxns, and wait the COMMIT in the same order. If the zxid of COMMIT mismatch with the zxid of head of pendingTxns, the follower will exit.
Multi-Paxos is used in Apache ZooKeeper
Zookeeper uses zab, not paxos. See this link for the difference.
In particular, each zookeeper node in an ensemble commits updates in the same order as every other nodes,
Unlike client requests, state updates must be applied in the exact
original generation order of the primary, starting from the original
initial state of the primary. If a primary fails, a new primary that
executes recovery cannot arbitrarily reorder uncommitted state
updates, or apply them starting from a different initial state.
Specifically the ZAB paper says that a newly elected leader undertakes discovery to learn the next epoch number to set and who has the most up-to-date commit history. The follower sands an ACK-E message which states the max contiguous zxid it has seen. It then says that it undertakes a synchronisation phase where it transmits the state which followers which they have missed. It notes that in interesting optimisation is to only elect a leader which has a most up to date commit history.
With Paxos you don't have to allow gaps. If you do allow gaps then the paper Paxos Made Simple explains how to resolve them from page 9. A new leader knows the last committed value it saw and possibly some committed values above. It probes the slots from the lowest gap it knows about by running phase 1 to propose to those slots. If there are values in those slots it runs phase 2 to fix those values but if it is free to set a value it sets no-op value. Eventually it gets to the slot number where there have been no values proposed and it runs as normal.
In answer to your questions:
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
They should attempt to lead after a randomised delay to try to reduce the risk of two candidates proposing at the same time which would waste messages and disk flushes as only one can lead. Randomised leader timeout is well covered in the Raft paper; the same approach can be used for Paxos.
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
The new leader should probe and fix the gaps to either the highest value proposed to that slot else a no-op until it has filled in the gaps then it can lead as normal.
The answer of #Hailin explains the gap problem as follows:
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO"
To supplement:
In the paper A simple totally ordered broadcast protocol, it mentions that ZooKeeper requires the prefix property:
If $m$ is the last message delivered for a leader $L$, any message proposed before $m$ by $L$ must also be delivered".
This property mainly relies on the TCP mechanism used in Zab. In Zab Wiki, it mentions that the implementation of Zab must follow the following assumption (besides others):
Servers must process packets in the order that they are received. Since TCP maintains ordering when sending packets, this means that packets will be processed in the order defined by the sender.