How does raft preserve safty when a leader commits a log entry and crashes before informing followers this commitment? - distributed-computing

In my understanding, a leader sends AppendEntries RPC to the followers, and if majority of followers return success, the leader will commit this entry. It will commit this entry by applying it to its own state machine, and it will also return to the client to let the client know that the command is successful.
However, at this time, this commitment is not known to the followers yet. It will inform the followers in the next AppendEntries (or heartbeat) RPC calls.
In the simplest case, if the leader crashes after the commitment and before the next AppendEntries, raft will use the "only most up to date follower can win" strategy to ensure that the next leader must contain this log entry (although not committed), and the new leader will commit this entry and send AppendEntries to other followers. In this way, the log entry is safely kept.
However, consider the following complicated scenario (extracted from PHD thesis "CONSENSUS: BRIDGING THEORY AND PRACTICE" page 23).
At this point, the log entry from term 2 has been replicated on a
majority of the servers, but it is not committed. If S1 crashes as in
(d1), S5 could be elected leader (with votes from S2, S3, and S4) and
overwrite the entry with its own entry from term 3.
How if at this point, it is committed in Server S1, but not committed in other servers yet? If S1 then crashes as in (d1), this log entry will be overwritten by S5?
In my understanding, a committed entry (applied to state machine and possibly informed the client about the result) shall never be overwritten?
Did I misunderstand anything of the raft protocol?
Thanks.

There are more conditions in Raft to commit an entry.
On page 4 of this paper (The 1-page summary of raft) it says
Leaders:
...
If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm set commitIndex = N (§5.3, §5.4).
In other words, not only does the entry have to be replicated to a majority, its term has to be from the current term. This is why in a practical system a new leader will propose a no-op so that it can advance this commitIndex.
So now we know the leader won't commit until then, but what if it commits but doesn't send the commit.
Later in section 5.4.1 of the same paper it says (emphasis mine):
Raft...guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election....Raft uses the voting process to prevent a candidate from winning an election unless its log contains all committed entries. A candidate must contact a majority of the cluster
in order to be elected, which means that every committed entry must be present in at least one of those servers.
In short, the new leader by definition must have the entries that the old leader considered committed.

Related

What happens to uncommitted previous term log entries in Raft algorithm?

There are a number of questions here on StackOverflow around Figure 8, discussed in section 5.4.2 in the original Raft paper:
Figure 8
What has not been made clear by the paper and by none of the answers is the exact fate of that problematic entry (2, 3). My question is two-fold:
What exactly happens to entry at index 2 during term 3 (2, 3), made by S5? The figure mentions that S5 will not become a leader because majority will reject its RequestVotes. Does that mean that upon receiving AppendEntries RPC, S5 will then overwrite its entry (2, 3) with (2, 2) and (3, 4) as per current leader in (e)?
If S5 is forced to overwrite this entry, and it is never committed, what response should the client that has sent (1, 3) receive? Do the clients receive acknowledgements for uncommitted entries as if they were already applied to a state-machine?
The figure mentions that S5 will not become a leader because majority
will reject its RequestVotes
As in (e) in raft paper, S5 will not become a leader because the logs of S5 is not at least up to date with the logs of majority (S1,S2,S3)
Does that mean that upon receiving AppendEntries RPC, S5 will then
overwrite its entry (2, 3) with (2, 2) and (3, 4) as per current
leader in (e)?
Yes, the logs of S5 will be overwritten by the logs of the current leader. Quoted from raft paper:
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
Do the clients receive acknowledgements for uncommitted entries as if
they were already applied to a state-machine?
No, the clients only receive acknowledgements for committed entries when the entry has been safely replicated. Please see a quote from raft paper:
When the entry has been safely replicated (as described below), the leader applies the entry to its state machine and returns the result of that
execution to the client.
There is also a case when the leader has replicated the log entry but crashes before responding to client or the response is lost when being sent over the network, the client needs to retry causing the command to be executed multiple times. Quouted from raft paper:
However, as described so far Raft can execute a command multiple times: for example, if the leader crashes after committing the log entry but before responding to the client, the client will retry the command with a new leader, causing it to be executed a second time. The solution is for clients to assign unique serial numbers to every command. Then, the state machine tracks the latest
serial number processed for each client, along with the associated response. If it receives a command whose serial number has already been executed, it responds immediately without re-executing the request

Raft leaders commit a no-op entry at the beginning of their term

Recently I have read a paper on the Raft consensus algorithm. The new leader does not know what is the current commit index.
How does a no-op solves this problem?
In Raft a new elected leader (implies that he received a majority of votes in the cluster. That means his log was at least up-to-date as the log of the nodes that granted him his votes) is not allowed to directly commit (I) entries from previous terms - previous leaders.
However he can do that implicitly. If he appends a new command to the log and replicates that command on other nodes, he can consider, as soon as the majority responds with an ok, that command as committed. This means implicit that all the previous commands are committed as well and can be passed to the state machine if not already done so.
Now if you add an no-op entry to the log, you are able to implicit commit previous commands and thus figure out which is the current commitIndex.
(I): mark a command to be safe to pass to the state machine. This is as soon as the command is replicated on the majority of nodes in the cluster.

How ZooKeeper guarantees "Single System Image"?

In the Consistency Guarantees section of ZooKeeper Programmer's Guide, it states that ZooKeeper will give "Single System Image" guarantees:
A client will see the same view of the service regardless of the server that it connects to.
According to the ZAB protocol, only if more than half of the followers acknowledge a proposal, the leader could commit the transaction. So it's likely that not all the followers are in the same status.
If the followers are not in the same status, how could ZooKeeper guarantees "Single System Status"?
References:
ZooKeeper’s atomic broadcast protocol: Theory and practice
Single System Image
Leader only waits for responses from a quorum of the followers to acknowledge to commit a transaction. That doesn't mean that some of the followers need not acknowledge the transaction or can "say no".
Eventually as the rest of the followers process the commit message from leader or as part of the synchronization, will have the same state as the master (with some delay). (not to be confused with Eventual consistency)
How delayed can the follower's state be depends on the configuration items syncLimit & tickTime (https://zookeeper.apache.org/doc/current/zookeeperAdmin.html)
A follower can at most be behind by syncLimit * tickTime time units before it gets dropped.
The document is a little misleading, I have made a pr.
see https://github.com/apache/zookeeper/pull/931.
In fact, zookeeper client keeps a zxid, so it will not connect to older follower if it has read some data from a newer server.
All reads and writes go to a majority of the nodes before being considered successful, so there's no way for a read following a write to not know about that previous write. At least one node knows about it. (Otherwise n/2+1 + n/2+1 > n, which is false.) It doesn't matter if many (at most all but one) has an outdated view of the world since at least one of them knows it all.
If enough nodes crash or the network becomes partitioned so that no group of nodes that are able to talk to each other are in a majority, Zab stops handling requests. If your acknowledged update gets accepted by a set of nodes that disappear and never come back online, your cluster will lose some data (but only when you ask it to move on, and leave its dead nodes behind).
Handling more than two requests is done by handling them two at a time, until there's only one state left.

Paxos algorithm in the context of distributed database transaction

I had some confusion about paxos, specifically in the context of database transactions:
In the paper "paxos made simple", it says in the second phase that the proposer needs to choose one of the values with the highest sequence number which one of the acceptors has accepted before (if no such value exists, the proposer is free to choose the original value is proposed).
Questions:
On one hand, I understand it does so to maintain the consensus.
But on the other hand, I had confusion about what the value actually is - what's the point of "having to send accepters the value that has been accepted before"?
In the context of database transactions, what if it needs to commit a new value? Does it need to start a new instance of Paxos?
If the answer to the above question is "Yes", then how does accepters reset the state? (In my understanding, if it doesn't reset the state, the proposer would be forced to send one of the old values that has been accepted before rather than being free to commit whatever the new value is.)
There are different kinds of paxos in the "Paxos made simple" paper. One is Paxos (plain paxos, single-degree paxos, synod), another is Multi-Paxos. From an engineer's point of view, the first is distributed write-once register and the second is distributed append only log.
Answers:
In the context of Paxos, the actual value is the value that was successfully written to the write-once register, it happens when a majority of the acceptors accept value of the same round. In the paper it was shown that the new chosen value always will be the same as previous (if it was chosen). So to get the actual value we should initiate a new round and return the new successfully written value.
In the context of Multi-Paxos the actual value is the latest value added to the log.
With Multi-Paxos we just add a new value to the log. To read the current value we read the log and return the latest version. On the low level Multi-Paxos is an array of Paxos-registers. To write a new value we put it with a position of the current value in a free register and then we fill previous free registers with no-op. When two registers contain two different next values for the same previous value we choose the register with the lowest position in the array.
It is possible and trivial with Multi-Paxos: we just start a new round of the Paxos over a free register. Although plain Paxos doesn't cover it, we can "extend" it and turn into a distributed variable instead of the dist. register. I described this idea and the proof in the "A memo on how Paxos works" post.
Rather than answering your questions directly, I'll try explaining how one might go about implementing a database transaction with Paxos, perhaps that will help clear things up.
The first thing to notice is that there are two "values" in question here. First is the database value, the application-level data that is being modified. Second is the 'Commit'/'Abort' decision. For Paxos-based transactions, the consensus "value" is the 'Commit'/'Abort' decision.
An important point to keep in mind about database transactions with respect to Paxos consensus is that Paxos does not guarantee that all of the peers involved in the transaction will actually see the consensus decision. When this is needed, as it usually is with databases, it's left to the application to ensure that this happens. This means that the state stored by some peers can lag behind others and any database application built on top of Paxos will need some mechanism for handling this. This can be very complicated and is all application-specific so I'm going to ignore all that completely and focus on ensuring that a simple majority of all database replicas agree on the value of revision 2 of the database key FOO which, of course, is initially set to BAR.
The first step is to send the new value for FOO, lets say that's BAZ, and it's expected current revision, 1, along with the Paxos Prepare message. When the database replicas receive this message, they'll first look up their local copy of FOO and check to see if the current revision matches the expected revision included along with the Prepare message. If they match the database replica will bundle a "Vote Commit" flag along with it's Promise message sent in response to the Prepare. If they don't match "Vote Abort" will be sent instead (the revision check protects against the case where the value was modified since the last time the application read it. Allowing overwrites in this case could corrupt application state).
Once the transaction driver receives a quorum of Promise messages along with their associated "Vote Commit"/"Vote Abort" values, it must chose to propose either "Commit" or "Abort". The first step in choosing this value is to follow the Paxos requirement of checking the Prepare messages to see if any database replicant (the Acceptor in Paxos terms) has already accepted a "Commit"/"Abort" decision. If any of them has, then the transaction driver must choose the "Commit"/"Abort" value associated with the highest previously accepted proposal ID. If they haven't it must decide on it's own. This is done by looking at the "Vote Commit"/"Vote Abort" values bundled with the Promises. If a quorum of "Vote Commmit"s are present, the transaction driver may propose "Commit", otherwise it must propose "Abort".
From that point on, it's all standard Paxos messages that get exchanged back and forth until consensus is reached on the 'Commit'/'Abort' decision. Assuming 'Commit' is chosen, the database replicants will update the value and revision associated with FOO to BAZ and 2, respectively.
I wrote a long blog with links to sourcecode on the topic of doing transaction log replication with paxos as described in the Paxos Made Simple paper. Here I give short answers to your questions. The blog post and sourcecode shows the complete picture.
On one hand, I understand it does so to maintain the consensus. But on
the other hand, I had confusion about what the value actually is -
what's the point of "having to send accepters the value that has been
accepted before"?
The value is the command the client is trying to run on the cluster. During an outage the client value transmitted to all nodes by the last leader may have only reached one node in the surviving majority. The new leader may not be that node. The new leader discovers the client value from at least one surviving node and then it transmits it to all the nodes in the surviving majority. In this manner, the new leader collaborates with the dead leader to complete any client work it may have had in progress.
In the context of database transactions, what if it needs to commit a
new value? Does it need to start a new instance of Paxos?
It cannot choose any new commands from clients until it has rebuilt the history of the chosen values selected by the last leader. The blog post talks about this as a "leader takeover phase" where after a crash of the old leader the new leader is trying to bring all nodes fully up to date.
In effect whatever the last leader transmitted which got to a majority of nodes is chosen; the new leader cannot change this history. During the takeover phase it is simply synchronising nodes to get them all up to date. Only when the new leader had finished this phase is it known to be fully up to date with all chosen values can it process any new client commands (i.e. process any new work).
If the answer to the above question is "Yes", then how does accepters
reset the state?
They don't. There is a gap between a value being chosen and any node learning that the value had been chosen. In the context of a database you cannot "commit" the value (apply it to the data store) until you have "learnt" the chosen value. Paxos ensures that a chosen value won't ever be undone. So don't commit the value until you learn that the value has been chosen. The blog post gives more details on this.
This is from my experience of implementing raft and reading the ZAB paper. Which are the two prevalent incarnations of PAXOS. I haven't really gotten into simple paxos or multipaxos.
When a client sends a commit to any node in the cluster, it will redirect that commit to the leader the leader then sends the commit message to each node in the quorum, when all of the nodes confirm the commit it will commit to it's own log.

What to do if the leader fails in Multi-Paxos for master-slave systems?

Backgound:
In section 3, named Implementing a State Machine, of Lamport's paper Paxos Made Simple, Multi-Paxos is described. Multi-Paxos is used in Google Paxos Made Live. (Multi-Paxos is used in Apache ZooKeeper). In Multi-Paxos, gaps can appear:
In general, suppose a leader can get α commands ahead--that is, it can propose commands i + 1 through i + α commands after commands 1 through i are chosen. A gap of up to α - 1 commands could then arise.
Now consider the following scenario:
The whole system uses master-slave architecture. Only the master serves client commands. Master and slaves reach consensus on the sequence of commands via Multi-Paxos. The master is the leader in Multi-Paxos instances. Assume now the master and two of its slaves have the states (commands have been chosen) shown in the following figure:
.
Note that, there are more than one gaps in the master state. Due to asynchrony, the two slaves lag behind. At this time, the master fails.
Problem:
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
Update about Zab:
As #sbridges has pointed out, ZooKeeper uses Zab instead of Paxos. To quote,
Zab is primarily designed for primary-backup (i.e., master-slave) systems, like ZooKeeper, rather than for state machine replication.
It seems that Zab is closely related to my problems listed above. According to the short overview paper of Zab, Zab protocol consists of two modes: recovery and broadcast. In recovery mode, two specific guarantees are made: never forgetting committed messages and letting go of messages that are skipped. My confusion about Zab is:
In recovery mode does Zab also suffer from the gaps problem? If so, what does Zab do?
The gap should be the Paxos instances that has not reached agreement. In the paper Paxos Made Simple, the gap is filled by proposing a special “no-op” command that leaves the state unchanged.
If you cares about the order of chosen values for Paxos instances, you'd better use Zab instead, because Paxos does not preserve causal order. https://cwiki.apache.org/confluence/display/ZOOKEEPER/PaxosRun
The missing command should be the Paxos instances that has reached agreement, but not learned by learner. The value is immutable because it has been accepted a quorum of acceptor. When you run a paxos instance of this instance id, the value will be chosen and recovered to the same one on phase 1b.
When slaves/followers detected a failure on Leader, or the Leader lost a quorum support of slaves/follower, they should elect a new leader.
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO.
In recovery mode, after the leader is elected, the follower synchronize with leader first, and apply the modification on state until NEWLEADER is received.
In broadcast mode, the follower queues the PROPOSAL in pendingTxns, and wait the COMMIT in the same order. If the zxid of COMMIT mismatch with the zxid of head of pendingTxns, the follower will exit.
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0
Multi-Paxos is used in Apache ZooKeeper
Zookeeper uses zab, not paxos. See this link for the difference.
In particular, each zookeeper node in an ensemble commits updates in the same order as every other nodes,
Unlike client requests, state updates must be applied in the exact
original generation order of the primary, starting from the original
initial state of the primary. If a primary fails, a new primary that
executes recovery cannot arbitrarily reorder uncommitted state
updates, or apply them starting from a different initial state.
Specifically the ZAB paper says that a newly elected leader undertakes discovery to learn the next epoch number to set and who has the most up-to-date commit history. The follower sands an ACK-E message which states the max contiguous zxid it has seen. It then says that it undertakes a synchronisation phase where it transmits the state which followers which they have missed. It notes that in interesting optimisation is to only elect a leader which has a most up to date commit history.
With Paxos you don't have to allow gaps. If you do allow gaps then the paper Paxos Made Simple explains how to resolve them from page 9. A new leader knows the last committed value it saw and possibly some committed values above. It probes the slots from the lowest gap it knows about by running phase 1 to propose to those slots. If there are values in those slots it runs phase 2 to fix those values but if it is free to set a value it sets no-op value. Eventually it gets to the slot number where there have been no values proposed and it runs as normal.
In answer to your questions:
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
They should attempt to lead after a randomised delay to try to reduce the risk of two candidates proposing at the same time which would waste messages and disk flushes as only one can lead. Randomised leader timeout is well covered in the Raft paper; the same approach can be used for Paxos.
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
The new leader should probe and fix the gaps to either the highest value proposed to that slot else a no-op until it has filled in the gaps then it can lead as normal.
The answer of #Hailin explains the gap problem as follows:
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO"
To supplement:
In the paper A simple totally ordered broadcast protocol, it mentions that ZooKeeper requires the prefix property:
If $m$ is the last message delivered for a leader $L$, any message proposed before $m$ by $L$ must also be delivered".
This property mainly relies on the TCP mechanism used in Zab. In Zab Wiki, it mentions that the implementation of Zab must follow the following assumption (besides others):
Servers must process packets in the order that they are received. Since TCP maintains ordering when sending packets, this means that packets will be processed in the order defined by the sender.