What is CRDT in Distributed Systems? - distributed-computing

I am a newbie in Distributed systems and I am trying to get an insight on the concept of CRDT.
I realize that it has three notations :
Conflict-free Replicated Data Type
Convergent Replicated Data Type
Commutative Replicated Data Type
Can anyone give an example where we use CRDT in distributed systems?
Thanks a lot in advance.

CRDTs are inspired by the work of Marc Shapiro. In distributed computing, a conflict-free replicated data type (abbreviated CRDT) is a type of specially-designed data structure used to achieve strong eventual consistency (SEC) and monotonicity (absence of rollbacks). There are two alternative routes to ensuring SEC: operation-based CRDTs and state-based CRDTs.
CRDTs on different replicas can diverge from one another but at the end they can be safely merged providing an eventually consistent value. In other words, CRDTs have a merge method that is idempotent, commutative and associative.
The two alternatives are equivalent, as one can emulate the other, but operation-based CRDTs require additional guarantees from the communication middleware. CRDTs are used to replicate data across multiple computers in a network, executing updates without the need for remote synchronization. This would lead to merge conflicts in systems using conventional eventual consistency technology, but CRDTs are designed such that conflicts are mathematically impossible. Under the constraints of the CAP theorem they provide the strongest consistency guarantees for available/partition-tolerant (AP) settings.
Some examples where they are used
Riak is the most popular open source library of CRDT's and is used by Bet365 and League of Legends. Below are some useful links that supports Riak.
1- Bet365 (Uses Erlang and Riak)
http://www.erlang-factory.com/static/upload/media/1434558446558020erlanguserconference2015bet365michaelowen.pdf
2- League of Legends uses the Riak CRDT implementation for its in-game chat system (which handles 7.5 million concurrent users and 11,000 messages per second)
3- Roshi implemented by SoundCloud that supports a LWW time-stamped Set:
-Blog post: https://developers.soundcloud.com/blog/roshi-a-crdt-system-for-timestamped-events

CRDTs use Math to enforce consistency across a distributed cluster, without having to worry about consensus and associated latency/unavailability.
The set of values that a CRDT can take at anytime come under the category of a semi-lattice (specifically a join semi-lattice), which is a POSET (partially-ordered set) with a least upper bound function (LUB).
In simple terms, a POSET is a collection of items in which not all are comparable. E.g. in an array of pairs: {(2,4), (4, 5), (2, 1), (6, 3)}, (2,4) is < (4,5), but can't be compared with (6,3) (since one element is larger and the other smaller). Now, a semi-lattice is a POSET in which given 2 pairs, even if you can't compare the two, you can find a element greater than both (LUB).
Another condition is that updates to this datatype need to be increasing, CRDTs have monotonically increasing state, where clients never observe state rollback.
This excellent article uses the array I used above as an example. For a CRDT maintaining those values, if 2 replicas are trying to achieve consensus between (4,5) and (6,3), they can pick a LUB = (6,5) as consensus and assign both replicas to it. Since the values
are increasing, this is a good value to settle on.
There's 2 ways for CRDTs to keep in sync with each other across replicas, they can transfer state across periodically (convergent replicated data type), or they can transfer updates (deltas) across as they get them (commutative replicated data type). The former takes more bandwidth.
SoundCloud's Roshi is a good example (though no-longer in development it seems), they store data associated with a timestamp, where the timestamp is obviously incrementing. Any updates coming in with a timestamp lesser or equal than the one stored is discarded, which ensures idempotency (repeated writes are OK) and commutativity (out of order writes are ok. Commutativity is a=b means b=a, which in this case means update1 followed by update2 is same as update2 followed by update1)
Writes are sent to all clusters, and if certain nodes fail to respond due to an issue like slowness or partition, they're expected to catch up later via a read-repair, which ensures that the values converge. The convergence can be achieved via 2 protocols as I mentioned above, propagate state or updates to other replicas. I believe Roshi does the former. As part of the read-repair, replicas exchange state, and because data adheres to the semi-lattice property, they converge.
PS. Systems using CRDTs are eventually consistent, i.e they adopt AP (highly available and partition-tolerant) in the CAP theorem.
Another excellent read on the subject.

Those three expansions of the acronym all mean basically the same thing.
A CRDT is convergent if the same operations applied in a different sequence produces (converges to) the same result. That is, the operations can be commutated - it's a commutative RDT. The reason that the operations can be applied in a different sequence and still get the same result is that the operations are conflict-free.
So CRDT means the same thing, whichever of the three expansions you use - though personally I prefer "Convergent".

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

CAP Theorem - What are the reasons for partitioning in first place?

There are plenty of good stackoverflow Q&A on the CAP theorem as CP vs AP etc.
In a nutshell the theorem states:
"In the presence of a partition you must sacrifice availability or consistency"
Lets imagine we speak about storage, databases in particular.
What are the technical reasons to Partition in first place?
(I'll try to take some guesses below):
OS can handle only so many ports/system handles.
Single "N Petabyte" Hard discs do not exist, you need more, until you run out of SATA/PCI ports.
Bringing the data closer to the consumer.
Single Database size is limited to size X.
Please note that there is a difference in meaning between "partitioning" as per CAP and "partitioning" as per physical database design.
"partitioning" as per CAP refers to what happens when a node in a distributed system becomes unavailable/unreachable, thus refers to phenomena that happen "at run time".
"partitioning" as per physical database design refers to the design decision to distribute the physical records representing the rows of one single table across various distinct physical stores, but 'distinct' might even then still mean only 'various distinct segments of one single store'. Anyhow, it thus refers to things that happen at design time.
In particular, it means that if you do "partitioning" as per physical database design, this does not necessarily lead to the existence of a "distributed system" in the sense of CAP. In particular, when "partitioning" as per physical database design, you do not necessarily create a system with various distinct runtime components operating "independently" : if you partition a table, you'll still typically have only one single DBMS that you are communicating with, thus only one sole runtime component.
Also in particular, if you "partition" as per physical database design, it is wrong to conclude that because of CAP theorem, consistency must necessarily have been sacrificed.

High availability options for Drools Fusion?

I have been digging and it seems:
1) There is no native/built-in failover solution for Drools Fusion 6
2) There is support for persistent sessions but it appears they are limited to save all/retrieve all, e.g. no ability to efficiently add and remove single events like hibernate would add/remove a single record from a DB. This would be expensive for a large, long running data set (STREAM mode)
3) Persistent sessions is a partial solution and I am unclear how we would even operate a cold/warm/hot standby
On the other hand Storm and Trident handle all aspects of failover but have limited support for CEP, I am debating using a custom solution with storm and storm tick tuples, but hate to reinvent the wheel.
I think in Storm Trident the state has to be relatively simple so it can fit into a key-value(s) pair, and the value cannot be too large. Such as a count or sum or some simple aggregation per key. Most people seem to use some time-based key and total up stuff with Trident. If there is complex state and multiple keys Storm Trident seems to falls down and cannot guarantee fully consistency between all states. Complex event processing keeps rich state such as intermediate pattern matches, derived indexes or data windows for many queries and many contexts. All that doesn't map well to Trident. Depending on your requirements Trident may be good enough.

avoiding overuse of consensus protocols in a distributed system

I'm new to distributed systems, and I'm reading about "simple Paxos". It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
And a catch-all: are there any other design-related or algorithmic optimizations people are doing to address this?
Good questions, and good insights!
It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
Yes, performance is a problem that my team had seen in practice as well. We maintain a consistent database & distributed lock manager; and orignally used Paxos for all writes, some reads and cluster membership updates.
Here are some of the optimizations we did:
As much as possible, nodes sent the transitions to a Distinguished Proposer/Learner (elected via Paxos), which
decided on write ordering, and
batched transitions while waiting for the response from the prior instance. (But batching too much also caused problems.)
We had considered using multi-paxos but we ended up doing something cooler (see below).
With these optimizations, we were still hurting for performance, so we split our server into three layers. The bottom layer is Paxos; it does what you suggest; viz. merely decides the node membership of the middle layer. The middle layer is a custom-in-house-high-speed chain consensus protocol, which does consensus & ordering for the DB. (BTW, chain-consensus can be viewed as Vertical Paxos.) The top layer now just maintains the database/locks & client connections. This design has lead to several orders of magnitude latency and throughput improvement.
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
These two together remind me of the Google Spanner paper. If you skip over the parts about time, it's essentially doing 2PC globally and Paxos on the shards. (IIRC.)

Consistent hashing versus distributed locks for handling race conditions

In a distributed system where workload is distributed to multiple nodes, two of the ways of dealing with race conditions where multiple requests to operate on the same data concurrently are the use of consistent hashing and distributed locks. Consistent hashing would ensure that all requests to operate on one set of data are sent to the same worker and distributed locks would ensure that only one worker could operate on any set of data at a time.
My question is what are the pros and cons of either approach and which might be favorable?
The consistent hashing is much easier to implement than a distributed lock. The problem is specific distribution of inputs could be sent only to a subset of the nodes, resulting in some words working harder than others. Distributed lock is harder to implement and requires several communication messages (or some shared data) but won't result in a bias of node allocations.