How do replicas coming back online in PAXOS or RAFT catch up? - distributed-computing

In consensus algorithms like for example PAXOS and RAFT, a value is proposed, and if a quorum agrees, it's written durably to the data store. What happens to the participants that were unavailable at the time of the quorum? How do they eventually catch up? This seems to be left as an exercise for the reader wherever I look.

Take a look at the Raft protocol. It’s simply built in to the algorithm. If the leader tracks the highest index (matchIndex) and the nextIndex to be sent to each follower, and the leader always sends entries to each follower starting at that follower’s nextIndex, there is no special case needed to handle catching up a follower that was missing when the entry was committed. By its nature, when the restarts, the leader will always begin sending entries to that follower starting with the last entry in its log. Thus the node is caught up.

This is mentioned in Paxos made Simple:
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
And also in Raft paper:
The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully.

With the original Paxos papers, it is indeed left as an exercise for the reader. In practice, with Paxos you can send additional messages such as negative acknowledgements to propagate more information around the cluster as a performance optimisation. That can be used to let a node know that it is behind due to lost messages. Once a node knows that it is behind it needs to catch up which can be done with additional message types. That is described as Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos.
The Google Chubby paxos paper Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it. Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust. The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe leader election and multi-Paxos. Unfortunately, Paxos is still taught in a theoretical manner, not in a modern style which leads people to think that it is hard and miss the essence of it.
I argue that Paxos is simple but that reasoning about failures in a distributed system and testing to find any bugs is hard. Everything that is left the reader in the original papers doesn't affect correctness but does effect latency, throughput and the complexity of the code. Once you understand what makes Paxos correct as it is mechanically simple it makes it straightforward to write the rest of what is needed in a way that doesn't violate consistency when you optimise the code for your use case and workload.
For example, Corfu and CURP give blisteringly high performance, one uses Paxos only for metadata, the other only needs to do Paxos when there are concurrent writes to the same keys. Those solutions don't directly complete with Raft or Multi-Paxos as they solve for specific high-performance scenarios (e.g., k-v stores). Yet they demonstrate that it's worth understanding that for practical applications there is a huge amount of optimisations you can make if your particular workload will let you while still using Paxos for some part of the overall solution.

Related

Why RAFT protocol rejects RequestVote with lower term?

In raft every node rejects any request with term number lower than it's own. But why do we need this for RequestVote rpc? If Leader Completeness Property holds, then node can vote for this candidate, right? So why reject request? I can't come up with example, where without this additional check when receiving RequestVote we can lose our consistency, safety, etc.
Maybe someone can help pls?
A candidate with lower term in RequestVote RPC has a log that could be not up to date with other nodes because a leader could be elected in a higher term and already replicated an entry to a majority of servers.
If this candidate is elected leader, the rules of raft cause this leader to not be able to do anything for safety reasons. Its RPCs will be rejected by other servers due to lower term.
I think there are these reasons.
Term in Raft is monotonically increasing. All requests from previous terms are requests made based on stale state and will only be rejected and returned with the current term and leader information.
Any elected leader must have all committed logs before the election, a candidate from the previous term is unlikely to have all committed logs since it won't have any committed logs from the current term.

Does paxos provide true linearizable consistency or not?

I think I might be confusing concepts here, but it seems to me like paxos would provide linearizable consistency for systems that implement it.
I know Cassandra uses it. I'm not 100% clear on how but assuming a leader is elected and that single leader does all the writes then communication is synchronous and real-time linearizability is achieved right?
But consensus algorithms like paxos are generally considered partially synchronous because there is a quorum (not 100% of node communication)- does this also mean it's not truly linearizable as well?
maybe because there is only a quorum a node could fall out of sync and that would break linearization?
A linearizable system does not need to be synchronous. Linearizability is a safety property: it says "nothing bad happens" but it doesn't affect linearizability if nothing good happens either. Any reads or writes that do not return (or that return an error) can be ignored when checking for linearizability. This means it's perfectly possible for a system to be linearizable even if one or more of the nodes are faulty or partitioned or running slowly.
Paxos is commonly used to implement a replicated state machine: a system that executes a sequence of operations on multiple nodes at once. Since the operations are deterministic and the nodes all agree on the operations to run and the sequence in which to run them, the nodes all converge to the same state (eventually).
You can implement a linearizable system using Paxos by having the operations in the sequence be writes and reads using the fact that the operations are placed in a totally-ordered sequence (i.e. linearized) by the Paxos protocol.
It's important to put the reads in the sequence as well as the writes. Imagine instead you only used Paxos to agree on the writes, and served reads directly from a node's local state. If the node serving the reads is partitioned from the other nodes then it would serve stale reads, violating linearizability. Each read must involve a quorum of nodes to ensure that the returned value is fresh, which means (effectively) putting the read into the sequence alongside the writes.
(There's some tricks you can play to make reads a bit more efficient than writes, given that reads commute with each other and don't need to be persisted to disk, but you can't escape the need to contact a quorum of nodes for both read and write operations)

Is a replication log necessary to achieve linearizability in distributed store

The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.
The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.
Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified

Why can't CP systems also be CAP?

My understanding of the CAP acronym is as follows:
Consistent: every read gets the most recent write
Available: every node is available
Partion Tolerant: the system can continue upholding A and C promises when the network connection between nodes goes down
Assuming my understanding is more or less on track, then something is bother me.
AFAIK, availability is achieved via any of the following techniques:
Load balancing
Replication to a disaster recovery system
So if I have a system that I already know is CP, why can't I "make it full CAP" by applying one of these techniques to make it available as well? I'm sure I'm missing something important here, just not sure what.
It's the partition tolerance, that you got wrong.
As long as there isn't any partitioning happening, systems can be consistent and available. There are CA systems which say, we don't care about partitions. You can have them running inside racks with server hardware and make partitioning extremely unlikely. The problem is, what if partitions occur?
The system can either choose to
continue providing the service, hoping the other server is down rather than providing the same service and serving different data - choosing availability (AP)
stop providing the service, because it couldn't guarantee consistency anymore, since it doesn't know if the other server is down or in fact up and running and just the communication between these two broke off - choosing consistency (CP)
The idea of the CAP theorem is that you cannot provide both Availability AND Consistency, once partitioning occurs, you can either go for availability and hope for the best, or play it safe and be unavailable, but consistent.
Here are 2 great posts, which should make it clear:
You Can’t Sacrifice Partition Tolerance shows the idea, that every truly distributed system needs to deal with partitioning now and than and hence CA systems will break instantly at the first occurrence of a partition
CAP Twelve Years Later: How the "Rules" Have Changed is slightly more up to date and shows the CAP theorem more flexible, where developers can choose how applications behave during partitioning and can sacrifice a bit of consistency to gain some availability, ...
So to finally answer your question, if you take a CP system and replicate it more often, you might either run into overhead of messages sent between the nodes of the system to keep it consistent, or - in case a substantial part of the nodes fails or network partitioning occurs without any part having a clear majority, it won't be able to continue operation as it wouldn't be able to guarantee consistency anymore. But yes, these lines are getting more blurred now and I think the references I've provided will give you a much better understanding.

Why is Chubby lockserver not multi-master?

As I understand Chubby at any given time there are 5 chubby servers. One is the master and handles coordination of writes to the quorum, and the other 4 servers are read only and forward handling of writes to the master. Writes use Paxos to maintain consistency.
Can someone explain to me why there is a distinction between the master and the 4 replicas. Why isn't Chubby multi-master? This question could also apply to Zookeeper.
Having a single master is more efficient because nodes don't have to deal with as much contention.
Both Chubby and Zookeeper implement a distributed state-machine where the point of the system is to decide a total ordering on transitions from one state to the next. It can take a lot of messages (theoretically infinite messages) to resolve a contention when multiple nodes are proposing a transition at the same time.
Paxos (and thus Chubby) uses an optimization called a "distinguished leader" where the replicas forward writes to the distinguished leader to reduce contention. (I am assuming Chubby replicas do this. If not they could, but the designers merely push that responsibility to the client.) Zookeeper does this too.
Both Chubby and Zookeeper actually do handle multiple leaders because they must deal with a leader that doesn't know it has died and then comes back from the dead. For Chubby, this is the whole point of using Paxos: eventually one of the leaders will win. (Well theoretically it may not, but we engineers do practical things like randomized backoff to make that probability tolerably small.) Zookeeper, on the other hand, assigns a non-decreasing id to each leader; and any non-current leader's transitions are rejected. This means that when a leader dies, Zookeeper has to pause and go through a reconfiguration phase before accepting new transitions.