Is there a relationship between CRDTs and the RAFT protocol - or are they orthogonal? - distributed-computing

Take the use case of a multi-player networked game. Instantly you have the problem of replicating and reconciling shared state across the network.
There appear to be a multiple of tools aimed at aspect of this problem, and two in particular seem to overlap:
Conflict-free Replicated Data Types (CRDTs) - used for
The RAFT consensus algorithm - for choosing a transactional leader in a distributed network to help achieve consensus.
My question is: Is there a relationship between CRDTs and the RAFT protocol - or are they orthogonal?

In distributed systems terms, the two are quite different and serve very different use cases. While both aim to achieve strong consistency, CRDTs do so generally without sacrificing availability, and Raft does so at the expense of availability. In the face of a network partition, CRDTs will remain available, but a Raft cluster can become either partially or fully unavailable. Raft is a consensus algorithm that relies on a majority of the cluster communicating with each other to progress.
There are also differences in the type of state that can be managed by each. CRDTs work to represent a limited and well defined set of data types, while Raft and other consensus algorithms can be used to model a much more wide array of potential data structures and algorithms. Raft is typically used to model a replicated state machine. Commands to the state machine are logged and replicated through the Raft algorithm and applied to a state machine. State machines can be used to model data structures like maps and sets or control concurrency by modeling things like locks, leader elections, and semaphores.
You also have to look at your system in terms of scalability, and Raft and CRTDs differ significant here as well. Raft is a leader-based system. Raft elects a single node as the leader, and all state changes to a Raft replicated state machine go through that single leader and are synchronously replicated to a majority of followers before being applied to the state machine. Alternatively, CRDTs are significantly more scalable as they're not limited by a single node.
Ultimately, the difference between Raft and CRDTs is the difference between consistency and performance. Raft is designed to create a consistent view of a single system with a focus on safety over performance. Typically, consensus algorithms like Raft are used for things like configuration management and service discovery. CRDTs are designed to be fast and as consistent as possible without sacrificing availability. Typically, CRDTs are used for storage in more availability dependent and less critical systems.

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

Does paxos provide true linearizable consistency or not?

I think I might be confusing concepts here, but it seems to me like paxos would provide linearizable consistency for systems that implement it.
I know Cassandra uses it. I'm not 100% clear on how but assuming a leader is elected and that single leader does all the writes then communication is synchronous and real-time linearizability is achieved right?
But consensus algorithms like paxos are generally considered partially synchronous because there is a quorum (not 100% of node communication)- does this also mean it's not truly linearizable as well?
maybe because there is only a quorum a node could fall out of sync and that would break linearization?
A linearizable system does not need to be synchronous. Linearizability is a safety property: it says "nothing bad happens" but it doesn't affect linearizability if nothing good happens either. Any reads or writes that do not return (or that return an error) can be ignored when checking for linearizability. This means it's perfectly possible for a system to be linearizable even if one or more of the nodes are faulty or partitioned or running slowly.
Paxos is commonly used to implement a replicated state machine: a system that executes a sequence of operations on multiple nodes at once. Since the operations are deterministic and the nodes all agree on the operations to run and the sequence in which to run them, the nodes all converge to the same state (eventually).
You can implement a linearizable system using Paxos by having the operations in the sequence be writes and reads using the fact that the operations are placed in a totally-ordered sequence (i.e. linearized) by the Paxos protocol.
It's important to put the reads in the sequence as well as the writes. Imagine instead you only used Paxos to agree on the writes, and served reads directly from a node's local state. If the node serving the reads is partitioned from the other nodes then it would serve stale reads, violating linearizability. Each read must involve a quorum of nodes to ensure that the returned value is fresh, which means (effectively) putting the read into the sequence alongside the writes.
(There's some tricks you can play to make reads a bit more efficient than writes, given that reads commute with each other and don't need to be persisted to disk, but you can't escape the need to contact a quorum of nodes for both read and write operations)

Is a replication log necessary to achieve linearizability in distributed store

The Raft algorithm used by etcd and ZAB algorithm by Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values. And why those system decided to use a replication log.
I my example if we have the following setup
machine A (Leader), contain version 1
machine B (Follower), contain version 1
machine C (Follower), contain version 1
And the write would go like this:
Machine A receive Write request and store pending write V2
Machine A send prepare request to Machine B and Machine C
Followers (Machine B and Machine C) send Acknowledge to leader (Machine A)
After Leader (machine A) receive Acknowledge from quorum of machine, it know V2 is now commited and send success response to client
Leader (machine a) send finalize request to Follower (machine A and Machine B) to inform them that V2 is commited and V1 could be discarded.
For this system to work, On leader change after acquiring leader Lease the leader machine have to get the latest data version by reading from a quorum of node before accepting Request.
The raft algorithm in ETCD and ZAB algorithm in Zookeeper are both using replication log to update a state machine.
I was wondering if it's possible to design a similar system by simply using leader election and versioned values.
Yes, it's possible to achieve consensus/linearizability without log replication. Originally the consensus problem was solved in the Paxos Made Simple paper by Leslie Lamport (1998). He described two algorithms: Single Decree Paxos to to build a distributed linearizable write-once register and Multi-Paxos to make a distributed state machine on top of append only log (an ordered array of write-once registers).
Append only logs is much more powerful abstraction than write-once registers therefore it isn't surprising that people chose logs over registers. Besides, until Vertical Paxos (2009) was published, log replication was the only consensus protocol capable of cluster membership change; what is vital for multiple tasks: if you can't replace failed nodes then eventually your cluster becomes unavailable.
Yet Vertical Paxos is a good paper, it was much easier for me to understand the Raft's idea of cluster membership via the joint consensus, so I wrote a post on how to adapt the Raft's way for Single Decree Paxos.
With time the "write-once" nature of the Single Decree Paxos was also resolved turning write-once registers into distributed linearizable variables, a quite powerful abstraction suitable for the many use cases. In the wild I saw that approach in the Treode database. If you got interested I blogged about this improved SDP in the How Paxos Works post.
So now when we have an alternative to logs it makes sense to consider it because log based replication is complex and has intrinsic limitations:
with logs you need to care about log compaction and garbage collection
size of the log is limited by the size of one node
protocols for splitting a log and migration to a new cluster are not well-known
And why those system decided to use a replication log.
The log-based approach is older that the alternative, so it has more time to gain popularity.
About your example
It's hard to evaluate it, because you didn't describe how the leader election happens and the conflicts between leaders are resolved, what is the strategy to handle failures and how to change membership of the cluster.
I believe if you describe them carefully you'll get a variant of Paxos.
Your example makes sense. However, have you considered every possible failure scenario? In step 2, Machine B could receive the message minutes before or after Machine C (or vice versa) due to network partitions or faulty routers. In step 3, the acknowledgements could be lost, delayed, or re-transmitted numerous times. The leader could also fail and come back up once, twice, or potentially several times all within the same consensus round. And in step 5, the messages could be lost, duplicated, or Machine A & C could receive the notification while B misses it....
Conceptual simplicity, also known as "reducing the potential points of failure", is key to distributed systems. Anything can happen, and will happen in realistic environments. Primitives, such as replicated logs based on consensus protocols proven to be correct in any environment, are a solid foundation upon which to build higher levels of abstraction. It's certainly true that better performance or latency or your "metric of interest" can be achieved by a custom-built algorithm but ensuring correctness for such an algorithm is a major time investment.
Replicated logs are simple, easily understood, predictable, and fall neatly into the domain of established consensus protocols (paxos, paxos-variants, & raft). That's why they're popular. It's not because they're the best for any particular application, rather they're understood and reliable.
For related references, you may be interested in Understanding Paxos and Consensus in the Cloud: Paxos Systems Demystified

Paxos and Discovery

Suppose I throw some machines in an elastic cluster and want to run some consensus algorithm in they (say, Paxos). Suppose they know the initial size of the network, say, 8 machines.
So, they'll run a consensus algorithm, and the quorum is 5.
Now, consider these cases:
I see that CPU is too low, and I reduce the cluster size in half, to 4 machines.
There is a partition split, and each split gets 4 machines.
If I take the current cluster size to get quorums, I'm subject to partition splits. Since for the underlying cluster, situations (1) and (2) look exactly the same. However, if I use a fixed number, I'm not able to scale down the cluster (and I'm subject to inconsistencies due to partition if I scale it up).
I have a third alternative, that of informing all the machines the size of the cluster when scaling, but there's a possibility of a partition happening right before a scale up, for instance, and that partition not receiving the new size and having enough quorum for a consensus using the old size.
Is Paxos (and any other safe consensus algorithms) unusable in an elastic environment?
Quorum-based consensus protocols fundamentally require quorums in order to operate. Both Multi-Paxos and Raft can be used in environments with dynamically changing cluster and quorum sizes but it must be done in a controlled manner that always maintains a consistent quorum. If, for example, you were currently using a cluster size of 8 and wanted to reduce that cluster to a size of 4. You could do so. However, that decision to reduce the cluster size to 4 must be a consensual decision that's agreed upon by the original 8.
Your question is a little unclear but it sounded like you were asking if you could safely reduce your cluster size to 4 as a recovery mechanism in the event that some kind of network partition renders your original cluster of 8 inoperable. The answer to that is effectively no since the decision to do so could not be consensual and attempting to go behind the back of the consensus algorithm is virtually guaranteed to result in inconsistencies. How would the new set of 4 be defined? How would you guarantee that all peers reached the same conclusion? How do you ensure they all make the same decision at the same time?
You could, of course, make all of these decisions manually and force the system to recover by shutting the consensus service down on each system and reconfiguring their quorum definition by hand. Assuming you don't screw up (which is an overwhelmingly big assumption for any real-world deployment) this would be safe. A better approach though would be to design the system such that one or two network partitions either won't halt the system (lots of sites) or use an eventual consistency model that gracefully handles the occasional network partitions. There's no magic bullet for getting around CAP restrictions.
Paxos and friends can scale in an elastic way (somewhat). Instead of changing the quorum size, though, just add learners. Learners are nodes that don't participate in consensus, but get all the decisions. Just like acceptors, learners accept reads and forward writes to the leader.
There are two styles of learner. The first listens to all events from acceptors; in the second the leader forwards all committed transitions to the learners

avoiding overuse of consensus protocols in a distributed system

I'm new to distributed systems, and I'm reading about "simple Paxos". It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
And a catch-all: are there any other design-related or algorithmic optimizations people are doing to address this?
Good questions, and good insights!
It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
Yes, performance is a problem that my team had seen in practice as well. We maintain a consistent database & distributed lock manager; and orignally used Paxos for all writes, some reads and cluster membership updates.
Here are some of the optimizations we did:
As much as possible, nodes sent the transitions to a Distinguished Proposer/Learner (elected via Paxos), which
decided on write ordering, and
batched transitions while waiting for the response from the prior instance. (But batching too much also caused problems.)
We had considered using multi-paxos but we ended up doing something cooler (see below).
With these optimizations, we were still hurting for performance, so we split our server into three layers. The bottom layer is Paxos; it does what you suggest; viz. merely decides the node membership of the middle layer. The middle layer is a custom-in-house-high-speed chain consensus protocol, which does consensus & ordering for the DB. (BTW, chain-consensus can be viewed as Vertical Paxos.) The top layer now just maintains the database/locks & client connections. This design has lead to several orders of magnitude latency and throughput improvement.
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
These two together remind me of the Google Spanner paper. If you skip over the parts about time, it's essentially doing 2PC globally and Paxos on the shards. (IIRC.)