Why is Chubby lockserver not multi-master? - apache-zookeeper

As I understand Chubby at any given time there are 5 chubby servers. One is the master and handles coordination of writes to the quorum, and the other 4 servers are read only and forward handling of writes to the master. Writes use Paxos to maintain consistency.
Can someone explain to me why there is a distinction between the master and the 4 replicas. Why isn't Chubby multi-master? This question could also apply to Zookeeper.

Having a single master is more efficient because nodes don't have to deal with as much contention.
Both Chubby and Zookeeper implement a distributed state-machine where the point of the system is to decide a total ordering on transitions from one state to the next. It can take a lot of messages (theoretically infinite messages) to resolve a contention when multiple nodes are proposing a transition at the same time.
Paxos (and thus Chubby) uses an optimization called a "distinguished leader" where the replicas forward writes to the distinguished leader to reduce contention. (I am assuming Chubby replicas do this. If not they could, but the designers merely push that responsibility to the client.) Zookeeper does this too.
Both Chubby and Zookeeper actually do handle multiple leaders because they must deal with a leader that doesn't know it has died and then comes back from the dead. For Chubby, this is the whole point of using Paxos: eventually one of the leaders will win. (Well theoretically it may not, but we engineers do practical things like randomized backoff to make that probability tolerably small.) Zookeeper, on the other hand, assigns a non-decreasing id to each leader; and any non-current leader's transitions are rejected. This means that when a leader dies, Zookeeper has to pause and go through a reconfiguration phase before accepting new transitions.

Related

Does paxos provide true linearizable consistency or not?

I think I might be confusing concepts here, but it seems to me like paxos would provide linearizable consistency for systems that implement it.
I know Cassandra uses it. I'm not 100% clear on how but assuming a leader is elected and that single leader does all the writes then communication is synchronous and real-time linearizability is achieved right?
But consensus algorithms like paxos are generally considered partially synchronous because there is a quorum (not 100% of node communication)- does this also mean it's not truly linearizable as well?
maybe because there is only a quorum a node could fall out of sync and that would break linearization?
A linearizable system does not need to be synchronous. Linearizability is a safety property: it says "nothing bad happens" but it doesn't affect linearizability if nothing good happens either. Any reads or writes that do not return (or that return an error) can be ignored when checking for linearizability. This means it's perfectly possible for a system to be linearizable even if one or more of the nodes are faulty or partitioned or running slowly.
Paxos is commonly used to implement a replicated state machine: a system that executes a sequence of operations on multiple nodes at once. Since the operations are deterministic and the nodes all agree on the operations to run and the sequence in which to run them, the nodes all converge to the same state (eventually).
You can implement a linearizable system using Paxos by having the operations in the sequence be writes and reads using the fact that the operations are placed in a totally-ordered sequence (i.e. linearized) by the Paxos protocol.
It's important to put the reads in the sequence as well as the writes. Imagine instead you only used Paxos to agree on the writes, and served reads directly from a node's local state. If the node serving the reads is partitioned from the other nodes then it would serve stale reads, violating linearizability. Each read must involve a quorum of nodes to ensure that the returned value is fresh, which means (effectively) putting the read into the sequence alongside the writes.
(There's some tricks you can play to make reads a bit more efficient than writes, given that reads commute with each other and don't need to be persisted to disk, but you can't escape the need to contact a quorum of nodes for both read and write operations)

How do replicas coming back online in PAXOS or RAFT catch up?

In consensus algorithms like for example PAXOS and RAFT, a value is proposed, and if a quorum agrees, it's written durably to the data store. What happens to the participants that were unavailable at the time of the quorum? How do they eventually catch up? This seems to be left as an exercise for the reader wherever I look.
Take a look at the Raft protocol. It’s simply built in to the algorithm. If the leader tracks the highest index (matchIndex) and the nextIndex to be sent to each follower, and the leader always sends entries to each follower starting at that follower’s nextIndex, there is no special case needed to handle catching up a follower that was missing when the entry was committed. By its nature, when the restarts, the leader will always begin sending entries to that follower starting with the last entry in its log. Thus the node is caught up.
This is mentioned in Paxos made Simple:
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
And also in Raft paper:
The leader maintains a nextIndex for each follower, which is the index of the next log entry the leader will send to that follower.
If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any).
If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully.
With the original Paxos papers, it is indeed left as an exercise for the reader. In practice, with Paxos you can send additional messages such as negative acknowledgements to propagate more information around the cluster as a performance optimisation. That can be used to let a node know that it is behind due to lost messages. Once a node knows that it is behind it needs to catch up which can be done with additional message types. That is described as Retransmission in the Trex multi-paxos engine that I wrote to demystify Paxos.
The Google Chubby paxos paper Paxos Made Live criticises Paxos for leaving a lot up to the people doing the implementation. Lamport trained as a mathematician and was attempting to mathematically prove that you couldn't have consensus over lossy networks when he found the solution. The original papers are very much supplying a proof it is possible rather than explaining how to build practical systems with it. Modern papers usually describe an application of some new techniques backed up by some experimental results, while they also supply a formal proof, IMHO most people skip over it and take it on trust. The unapproachable way that Paxos was introduced means that many people who quote the original paper but have failed to see that they describe leader election and multi-Paxos. Unfortunately, Paxos is still taught in a theoretical manner, not in a modern style which leads people to think that it is hard and miss the essence of it.
I argue that Paxos is simple but that reasoning about failures in a distributed system and testing to find any bugs is hard. Everything that is left the reader in the original papers doesn't affect correctness but does effect latency, throughput and the complexity of the code. Once you understand what makes Paxos correct as it is mechanically simple it makes it straightforward to write the rest of what is needed in a way that doesn't violate consistency when you optimise the code for your use case and workload.
For example, Corfu and CURP give blisteringly high performance, one uses Paxos only for metadata, the other only needs to do Paxos when there are concurrent writes to the same keys. Those solutions don't directly complete with Raft or Multi-Paxos as they solve for specific high-performance scenarios (e.g., k-v stores). Yet they demonstrate that it's worth understanding that for practical applications there is a huge amount of optimisations you can make if your particular workload will let you while still using Paxos for some part of the overall solution.

How to avoid loss of internal state of a master during fail-over to new master during a network partition

I was trying to implement a simple single master node against multiple backup nodes system to learn about distributed and fault tolerant architecture.
Currently this is what my system looks like:
N different nodes, each one identical. 1 master node running a simple webserver.
All nodes communicate with each other using simple heartbeat protocol and each maintain global state (count of nodes available, who is master, downtime and uptime of each other.)
If any node does not hear from master for some set time, if raises a alarm. If a consensus is reached that the master is down, new master is elected.
If the network of nodes gets partitioned.
And the master is in minor partition, then it will stop serving request and go down by itself after a set period of time. Minor group cannot elect master (some minimum nodes require to make decision)
New master gets selected in the major partition after a set time after not hearing from old master.
Now I am stuck with a problem, that is, in the step 4 above, there is a time gap where the old master is still serving the requests, while new master getting elected in the major node.
This seems can cause inconsistent data across the system if some client decided to write new data to old master. How we avoid this issue. Would be glad if someone points me to right direction.
Rather than accepting writes to the minority master, what you want is to simply reject writes to the old master in that case, and you can do so by attempting to verify its mastership with a majority of the cluster on each write. If the master is on the minority side of a partition, it will no longer be able to contact a majority of the cluster and so will not be able to acknowledge clients’ requests. This brief period of unavailability is preferable to losing acknowledged writes in quorum based systems.
You should read the Raft paper. You’re slowly moving towards an implementation of the Raft protocol, and it will probably answer many of the questions you might have alonggn the way.

Is a replication log necessary to achieve linearizability in distributed store

The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.
The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.
Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified

Paxos and Discovery

Suppose I throw some machines in an elastic cluster and want to run some consensus algorithm in they (say, Paxos). Suppose they know the initial size of the network, say, 8 machines.
So, they'll run a consensus algorithm, and the quorum is 5.
Now, consider these cases:
I see that CPU is too low, and I reduce the cluster size in half, to 4 machines.
There is a partition split, and each split gets 4 machines.
If I take the current cluster size to get quorums, I'm subject to partition splits. Since for the underlying cluster, situations (1) and (2) look exactly the same. However, if I use a fixed number, I'm not able to scale down the cluster (and I'm subject to inconsistencies due to partition if I scale it up).
I have a third alternative, that of informing all the machines the size of the cluster when scaling, but there's a possibility of a partition happening right before a scale up, for instance, and that partition not receiving the new size and having enough quorum for a consensus using the old size.
Is Paxos (and any other safe consensus algorithms) unusable in an elastic environment?
Quorum-based consensus protocols fundamentally require quorums in order to operate. Both Multi-Paxos and Raft can be used in environments with dynamically changing cluster and quorum sizes but it must be done in a controlled manner that always maintains a consistent quorum. If, for example, you were currently using a cluster size of 8 and wanted to reduce that cluster to a size of 4. You could do so. However, that decision to reduce the cluster size to 4 must be a consensual decision that's agreed upon by the original 8.
Your question is a little unclear but it sounded like you were asking if you could safely reduce your cluster size to 4 as a recovery mechanism in the event that some kind of network partition renders your original cluster of 8 inoperable. The answer to that is effectively no since the decision to do so could not be consensual and attempting to go behind the back of the consensus algorithm is virtually guaranteed to result in inconsistencies. How would the new set of 4 be defined? How would you guarantee that all peers reached the same conclusion? How do you ensure they all make the same decision at the same time?
You could, of course, make all of these decisions manually and force the system to recover by shutting the consensus service down on each system and reconfiguring their quorum definition by hand. Assuming you don't screw up (which is an overwhelmingly big assumption for any real-world deployment) this would be safe. A better approach though would be to design the system such that one or two network partitions either won't halt the system (lots of sites) or use an eventual consistency model that gracefully handles the occasional network partitions. There's no magic bullet for getting around CAP restrictions.
Paxos and friends can scale in an elastic way (somewhat). Instead of changing the quorum size, though, just add learners. Learners are nodes that don't participate in consensus, but get all the decisions. Just like acceptors, learners accept reads and forward writes to the leader.
There are two styles of learner. The first listens to all events from acceptors; in the second the leader forwards all committed transitions to the learners