Paxos and Discovery - distributed-computing

Paxos and Discovery - distributed-computing

Suppose I throw some machines in an elastic cluster and want to run some consensus algorithm in they (say, Paxos). Suppose they know the initial size of the network, say, 8 machines.
So, they'll run a consensus algorithm, and the quorum is 5.
Now, consider these cases:
I see that CPU is too low, and I reduce the cluster size in half, to 4 machines.
There is a partition split, and each split gets 4 machines.
If I take the current cluster size to get quorums, I'm subject to partition splits. Since for the underlying cluster, situations (1) and (2) look exactly the same. However, if I use a fixed number, I'm not able to scale down the cluster (and I'm subject to inconsistencies due to partition if I scale it up).
I have a third alternative, that of informing all the machines the size of the cluster when scaling, but there's a possibility of a partition happening right before a scale up, for instance, and that partition not receiving the new size and having enough quorum for a consensus using the old size.
Is Paxos (and any other safe consensus algorithms) unusable in an elastic environment?

Quorum-based consensus protocols fundamentally require quorums in order to operate. Both Multi-Paxos and Raft can be used in environments with dynamically changing cluster and quorum sizes but it must be done in a controlled manner that always maintains a consistent quorum. If, for example, you were currently using a cluster size of 8 and wanted to reduce that cluster to a size of 4. You could do so. However, that decision to reduce the cluster size to 4 must be a consensual decision that's agreed upon by the original 8.
Your question is a little unclear but it sounded like you were asking if you could safely reduce your cluster size to 4 as a recovery mechanism in the event that some kind of network partition renders your original cluster of 8 inoperable. The answer to that is effectively no since the decision to do so could not be consensual and attempting to go behind the back of the consensus algorithm is virtually guaranteed to result in inconsistencies. How would the new set of 4 be defined? How would you guarantee that all peers reached the same conclusion? How do you ensure they all make the same decision at the same time?
You could, of course, make all of these decisions manually and force the system to recover by shutting the consensus service down on each system and reconfiguring their quorum definition by hand. Assuming you don't screw up (which is an overwhelmingly big assumption for any real-world deployment) this would be safe. A better approach though would be to design the system such that one or two network partitions either won't halt the system (lots of sites) or use an eventual consistency model that gracefully handles the occasional network partitions. There's no magic bullet for getting around CAP restrictions.

Paxos and friends can scale in an elastic way (somewhat). Instead of changing the quorum size, though, just add learners. Learners are nodes that don't participate in consensus, but get all the decisions. Just like acceptors, learners accept reads and forward writes to the leader.
There are two styles of learner. The first listens to all events from acceptors; in the second the leader forwards all committed transitions to the learners

Related

How CA distributed system according to Cap Theorem can exist

How can a distributed system be consistent and available (CA)?
Because I would argue when a network partition occurs, CA cannot be possible in a way where every node of the network, even the partioned nodes that users are connected to, continue to be available and answer with consistent data.

It can't.
As often mentioned, the CAP theorem in its original form is a little misleading. It can be restated as
in the presence of the network partition, a distributed system is either available or consistent
so you are right. Generally, systems cannot be classified as CA, CP or AP only, since partition tolerance is a property of the system, which describes what to choose in case of a network partition. So it is possible that a system can behave according to AP sometimes, and CP other times (however it is not common).
Another interesting part is that RDBMS databases are often at the CA side of the triangle. This is only the case in a single node setup. Even with master (write) - slave (read) setup, the system is not CA (or if it is termed "CA" for some reason, and cannot recover from network partitions, then a split-bran scenario may happen, a new master is elected for the partition, and chaos ensues, possibly breaking the consistency of the system).
Useful read: https://codahale.com/you-cant-sacrifice-partition-tolerance/.

It can, but it won't.
The CAP theorem reasons about guarantees when one or more nodes get isolated from the rest of the cluster. In such cases a node has three options which result in the three known CAP trade-offs: i) it keeps responding to any received requests AP; ii) it no longer responds to received requests until it is again able to reach the others CP; iii) it shuts down before receiving any requests to eliminate the partition along with it CA.
In other words you can achieve CA by having your nodes shutting down instead of tolerating the partition but bear in mind that partitions are likely to keep happening hence this will converge to the scenario in which you have a single node in your cluster and I assume this is the opposite of what you want, i.e. having a cluster with multiple nodes is kind of the whole point.
Therefore in practice you end up choosing between CP and CA. See this answer for more illustrative examples.

Dr. Stonebraker says: The guidance from the CAP theorem is that you
must choose either A or C, when a network partition is present. As is
obvious in the real world, it is possible to achieve both C and A in
this failure mode.
See this for thoughts on why CA can exist:
CA is a specification of the operating range: you specify that the
system does not work well under partition or, more precisely, that
partitions are outside the operating range of the system.
My background is far from these theoretical considerations and I must say it is highly confusing. I am researching distributed Blockchain systems and I don't see why those "generalized" definitions of C, A, P must always apply. If let's say 5% of nodes fail or are otherwise partitioned, the consensus still functions. If an end user is connected to a partitioned node, the node could let the user know it lost connection. I don't even see how any major Blockchain network is CP without defining conditions such as "if a certain amount of nodes fail or get partitioned, the consensus halts".

Is a replication log necessary to achieve linearizability in distributed store

The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.

The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.

Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified

Why can't CP systems also be CAP?

My understanding of the CAP acronym is as follows:
Consistent: every read gets the most recent write
Available: every node is available
Partion Tolerant: the system can continue upholding A and C promises when the network connection between nodes goes down
Assuming my understanding is more or less on track, then something is bother me.
AFAIK, availability is achieved via any of the following techniques:
Load balancing
Replication to a disaster recovery system
So if I have a system that I already know is CP, why can't I "make it full CAP" by applying one of these techniques to make it available as well? I'm sure I'm missing something important here, just not sure what.

It's the partition tolerance, that you got wrong.
As long as there isn't any partitioning happening, systems can be consistent and available. There are CA systems which say, we don't care about partitions. You can have them running inside racks with server hardware and make partitioning extremely unlikely. The problem is, what if partitions occur?
The system can either choose to
continue providing the service, hoping the other server is down rather than providing the same service and serving different data - choosing availability (AP)
stop providing the service, because it couldn't guarantee consistency anymore, since it doesn't know if the other server is down or in fact up and running and just the communication between these two broke off - choosing consistency (CP)
The idea of the CAP theorem is that you cannot provide both Availability AND Consistency, once partitioning occurs, you can either go for availability and hope for the best, or play it safe and be unavailable, but consistent.
Here are 2 great posts, which should make it clear:
You Can’t Sacrifice Partition Tolerance shows the idea, that every truly distributed system needs to deal with partitioning now and than and hence CA systems will break instantly at the first occurrence of a partition
CAP Twelve Years Later: How the "Rules" Have Changed is slightly more up to date and shows the CAP theorem more flexible, where developers can choose how applications behave during partitioning and can sacrifice a bit of consistency to gain some availability, ...
So to finally answer your question, if you take a CP system and replicate it more often, you might either run into overhead of messages sent between the nodes of the system to keep it consistent, or - in case a substantial part of the nodes fails or network partitioning occurs without any part having a clear majority, it won't be able to continue operation as it wouldn't be able to guarantee consistency anymore. But yes, these lines are getting more blurred now and I think the references I've provided will give you a much better understanding.

Why does a mongodb replica set need an odd number of voting members?

If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?

TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.

avoiding overuse of consensus protocols in a distributed system

I'm new to distributed systems, and I'm reading about "simple Paxos". It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
And a catch-all: are there any other design-related or algorithmic optimizations people are doing to address this?

Good questions, and good insights!
It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
Yes, performance is a problem that my team had seen in practice as well. We maintain a consistent database & distributed lock manager; and orignally used Paxos for all writes, some reads and cluster membership updates.
Here are some of the optimizations we did:
As much as possible, nodes sent the transitions to a Distinguished Proposer/Learner (elected via Paxos), which
decided on write ordering, and
batched transitions while waiting for the response from the prior instance. (But batching too much also caused problems.)
We had considered using multi-paxos but we ended up doing something cooler (see below).
With these optimizations, we were still hurting for performance, so we split our server into three layers. The bottom layer is Paxos; it does what you suggest; viz. merely decides the node membership of the middle layer. The middle layer is a custom-in-house-high-speed chain consensus protocol, which does consensus & ordering for the DB. (BTW, chain-consensus can be viewed as Vertical Paxos.) The top layer now just maintains the database/locks & client connections. This design has lead to several orders of magnitude latency and throughput improvement.
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
These two together remind me of the Google Spanner paper. If you skip over the parts about time, it's essentially doing 2PC globally and Paxos on the shards. (IIRC.)