avoiding overuse of consensus protocols in a distributed system - distributed-computing

I'm new to distributed systems, and I'm reading about "simple Paxos". It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
And a catch-all: are there any other design-related or algorithmic optimizations people are doing to address this?

Good questions, and good insights!
It creates a lot of chatter and I'm thinking about performance implications.
Let's say you're building a globally-distributed database, with several small-ish clusters located in different locations. It seems important to minimize the amount of cross-site communication.
What are the decisions you definitely need to use consensus for? The only one I thought of for sure was deciding whether to add or remove a node (or set of nodes?) from the network. It seems like this is necessary for vector clocks to work. Another I was less sure about was deciding on an ordering for writes to the same location, but should this be done by a leader which is elected via Paxos?
Yes, performance is a problem that my team had seen in practice as well. We maintain a consistent database & distributed lock manager; and orignally used Paxos for all writes, some reads and cluster membership updates.
Here are some of the optimizations we did:
As much as possible, nodes sent the transitions to a Distinguished Proposer/Learner (elected via Paxos), which
decided on write ordering, and
batched transitions while waiting for the response from the prior instance. (But batching too much also caused problems.)
We had considered using multi-paxos but we ended up doing something cooler (see below).
With these optimizations, we were still hurting for performance, so we split our server into three layers. The bottom layer is Paxos; it does what you suggest; viz. merely decides the node membership of the middle layer. The middle layer is a custom-in-house-high-speed chain consensus protocol, which does consensus & ordering for the DB. (BTW, chain-consensus can be viewed as Vertical Paxos.) The top layer now just maintains the database/locks & client connections. This design has lead to several orders of magnitude latency and throughput improvement.
It would be nice to avoid having all nodes in the system making decisions together. Could a few nodes at each local cluster participate in cross-cluster decisions, and all local nodes communicate using a local Paxos to determine local answers to cross-site questions? The latency would be the same assuming the network is not saturated, but the cross-site network traffic would be much lighter.
Let's say you can split your database's tables along rows, and assign each subset of rows to a subset of nodes. Is it normal to elect a set of nodes to contain each subset of the data using Paxos across all machines in the system, and then only run Paxos between those nodes for all operations dealing with that subset of data?
These two together remind me of the Google Spanner paper. If you skip over the parts about time, it's essentially doing 2PC globally and Paxos on the shards. (IIRC.)

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

Are all distributed database designed to process data in parallel?

I am learning about the characteristics of distributed database and I came across this website that describes some of the advantages of distributed database:
https://www.atlantic.net/cloud-hosting/about-distributed-databases-and-distributed-data-systems/
According to that site, the advantages of distributed database are listed below:
Reliability – Building an infrastructure is similar to investing: diversify to reduce your chances of loss. Specifically, if a failure occurs in one area of the distribution, the entire database does not experience a setback.
Security – You can give permissions to single sections of the overall database, for better internal and external protection.
Cost-effective – Bandwidth prices go down because users are accessing remote data less frequently.
Local access – Similarly to #1 above, if there is a failure in the umbrella network, you can still get access to your portion of the database.
Growth – If you add a new location to your business, it’s simple to create an additional node within the database, making distribution highly scalable.
Speed & resource efficiency – Most requests and other interactivity with the database are performed at a local level, also decreasing remote traffic.
Responsibility & containment – Because any glitches or failures occur locally, the issue is contained and can potentially be handled by the IT staff designated to handle that piece of the company.
However, parallelism (I mean not concurrent write, but processing data in parallel in each node) is not on the list. This makes me wonder: are all distributed databases (i.e. Mongo DB, Cassandra, HBase) designed to process data in parallel? If this is false, which distributed databases support parallel processing and which ones don't?
To find out what I mean by Parallel Processing (not concurrent write), please see: https://softwareengineering.stackexchange.com/questions/190719/the-difference-between-concurrent-and-parallel-execution

Paxos and Discovery

Suppose I throw some machines in an elastic cluster and want to run some consensus algorithm in they (say, Paxos). Suppose they know the initial size of the network, say, 8 machines.
So, they'll run a consensus algorithm, and the quorum is 5.
Now, consider these cases:
I see that CPU is too low, and I reduce the cluster size in half, to 4 machines.
There is a partition split, and each split gets 4 machines.
If I take the current cluster size to get quorums, I'm subject to partition splits. Since for the underlying cluster, situations (1) and (2) look exactly the same. However, if I use a fixed number, I'm not able to scale down the cluster (and I'm subject to inconsistencies due to partition if I scale it up).
I have a third alternative, that of informing all the machines the size of the cluster when scaling, but there's a possibility of a partition happening right before a scale up, for instance, and that partition not receiving the new size and having enough quorum for a consensus using the old size.
Is Paxos (and any other safe consensus algorithms) unusable in an elastic environment?
Quorum-based consensus protocols fundamentally require quorums in order to operate. Both Multi-Paxos and Raft can be used in environments with dynamically changing cluster and quorum sizes but it must be done in a controlled manner that always maintains a consistent quorum. If, for example, you were currently using a cluster size of 8 and wanted to reduce that cluster to a size of 4. You could do so. However, that decision to reduce the cluster size to 4 must be a consensual decision that's agreed upon by the original 8.
Your question is a little unclear but it sounded like you were asking if you could safely reduce your cluster size to 4 as a recovery mechanism in the event that some kind of network partition renders your original cluster of 8 inoperable. The answer to that is effectively no since the decision to do so could not be consensual and attempting to go behind the back of the consensus algorithm is virtually guaranteed to result in inconsistencies. How would the new set of 4 be defined? How would you guarantee that all peers reached the same conclusion? How do you ensure they all make the same decision at the same time?
You could, of course, make all of these decisions manually and force the system to recover by shutting the consensus service down on each system and reconfiguring their quorum definition by hand. Assuming you don't screw up (which is an overwhelmingly big assumption for any real-world deployment) this would be safe. A better approach though would be to design the system such that one or two network partitions either won't halt the system (lots of sites) or use an eventual consistency model that gracefully handles the occasional network partitions. There's no magic bullet for getting around CAP restrictions.
Paxos and friends can scale in an elastic way (somewhat). Instead of changing the quorum size, though, just add learners. Learners are nodes that don't participate in consensus, but get all the decisions. Just like acceptors, learners accept reads and forward writes to the leader.
There are two styles of learner. The first listens to all events from acceptors; in the second the leader forwards all committed transitions to the learners

Why can't CP systems also be CAP?

My understanding of the CAP acronym is as follows:
Consistent: every read gets the most recent write
Available: every node is available
Partion Tolerant: the system can continue upholding A and C promises when the network connection between nodes goes down
Assuming my understanding is more or less on track, then something is bother me.
AFAIK, availability is achieved via any of the following techniques:
Load balancing
Replication to a disaster recovery system
So if I have a system that I already know is CP, why can't I "make it full CAP" by applying one of these techniques to make it available as well? I'm sure I'm missing something important here, just not sure what.
It's the partition tolerance, that you got wrong.
As long as there isn't any partitioning happening, systems can be consistent and available. There are CA systems which say, we don't care about partitions. You can have them running inside racks with server hardware and make partitioning extremely unlikely. The problem is, what if partitions occur?
The system can either choose to
continue providing the service, hoping the other server is down rather than providing the same service and serving different data - choosing availability (AP)
stop providing the service, because it couldn't guarantee consistency anymore, since it doesn't know if the other server is down or in fact up and running and just the communication between these two broke off - choosing consistency (CP)
The idea of the CAP theorem is that you cannot provide both Availability AND Consistency, once partitioning occurs, you can either go for availability and hope for the best, or play it safe and be unavailable, but consistent.
Here are 2 great posts, which should make it clear:
You Can’t Sacrifice Partition Tolerance shows the idea, that every truly distributed system needs to deal with partitioning now and than and hence CA systems will break instantly at the first occurrence of a partition
CAP Twelve Years Later: How the "Rules" Have Changed is slightly more up to date and shows the CAP theorem more flexible, where developers can choose how applications behave during partitioning and can sacrifice a bit of consistency to gain some availability, ...
So to finally answer your question, if you take a CP system and replicate it more often, you might either run into overhead of messages sent between the nodes of the system to keep it consistent, or - in case a substantial part of the nodes fails or network partitioning occurs without any part having a clear majority, it won't be able to continue operation as it wouldn't be able to guarantee consistency anymore. But yes, these lines are getting more blurred now and I think the references I've provided will give you a much better understanding.

StarCounter and CAP

I have been reading about a database named Starcounter. It makes a claim that it can handle loads that a "NoSql"-database only can handle without dropping consistency. As far as I understand the CAP-theorem, if you keep consistency, you lose availability or partition tolerance. So what trick makes StarCounter work?
I can imagine that StarCounter is fast, but the claim that NoSql needs to drop consistency to keep up seems a little bit strange to me. Can anyone please explain?
Thanks in advance
Roland
The short answer
The CAP theorem (aka Brewers theorem) cannot be beaten for a single piece of information (like a consistent database). If you have a horizontally scaled database, you won't get consistency and performance. This conclusion comes from the laws of physics and can be deducted from Brewers theorem and Einsteins theories of relativity. You need to scale-in/up, not out. Not very "cloudy", but as the enemies of Galileo would probably confess if they were alive today, nature does a poor job at honouring human fashion.
Scaling consistent data
I'm sure there are other approaches, but Starcounter works by hosting the database image in RAM. Instead of moving database data to the application code, parts of the application code is moved to the database. Only data in the final response gets moved from the original place in RAM memory (where the data was in the first place). This makes most of the data stay put even if there are millions of requests processed every second. The downside is that the database needs to know the programming language of your application logic. The upside, however, is obvious if you have ever tried to serve millions of HTTP requests/sec, each requiring extensive database access.
A more thourough answer
The question is a good one. It is no wonder you find it strange as it was only a few years back that CAP was proven (turned into a theorem). Many developers are as disappointed as a kid would be when theoretical physicist tells him to stop looking for the perpetual motion machine because it cannot work. We still want the scale-out consistent database, don't we?
The CAP theorem
The CAP theorem gives that any piece of information cannot have consistency (C), availability (A) and partition tolerance (P). It applies to a unit of information (such as a database). You can of course have independent pieces of information that operates differently. One piece could be AP, another could be CA and a third could be CP. You just cant have the same information being CAP.
The problem with the impossibility of the 'P' in a consistent and available database results in how a scaled-out database MUST do signalling between the nodes. The conclusion must be, that even in a hundred years from now, CAP gives that a single piece of consistent data will have to live on hardware interconnected using hard wires or light beams.
The problem with the P in CAP
The problem lies in performance if you apply horizontal scaling to an available consistent database. A good performance was the very reason to do horizontally scaling in the first place, this is a very bad thing. As every node needs communicate with the other nodes whenever there is database access to achieve consistency, and given the fact that signalling is ultimately limited by the speed of light, you are left with sad but true fact that database scientist (as well as CPU scientists) are not just being stubborn for failing to see scale-out as a a magical silver bullet. It will not happen because it cannot happen (however, parts of your database could be placed in a AP set, so remember, we are talking about consistent data here). Adding the theories of Einstein to the CAP theorem and the small box wins of the cloudy data-center for consistent data.
Perpetual machines and CAP
The state of things in the database community is a little bit like the state of perpetual motion machines when horse and carriage was the way to get to work. Without any theoretical evidence against it, the patent offices granted hundreds of patents for impossible perpetual machines. Today, we may laugh at this, but we have a similar situations in the database industry with consistent scale-out databases. When you hear somebody claim that they have a scale-out ACID database, be cautious. It was only after the dot com crash mathematicians at MIT proved Brewer right at the CAP theorem was officially born, so the hunt for the impossible has unfortunately not died off just yet. You can compare this, if you want, to the way laggards kept trying to invent the perpetual machine for years after modern theoretical physics should reasonably have put a stop to it. Old habits die hard (my apologies to anyone on Stack overflow still making drawings of bearings and arms moving ad finitum on their own accord - I don't mean to be offensive).
CAP and performance
All is not lost however. Not all pieces of information needs to be consistent. Not all pieces needs to scale-out. You just have the accept Brewers theorem and make the best out of it.
For applications such as Facebook, consistency is dropped. This is okay as data is entered once and then is manipulated by a single users. Still we can experience the side effects in everyday Facebook usage such as things popping in and out of existence for a while.
However, in most business applications, data needs to be correct. The sum of all accounts in your bookkeeping needs to amount to zero. Your stock inventory must equal to 8 if you sold 2 out of 10 items even if there are multiple users buying from the same stock.
The problem with scaling out available data is that you have to make do without partition tolerance. This fancy word simply means that you have to signal between the nodes in your cloud at all times. And as it takes light a few nanoseconds to travel a single meter, this becomes impossible without making your scale-out result in less performance rather than more performance. Of course, this is only true for consistent data. The implications of this has been known by the engineers of Intel, AMD, Oracle et. al for a long time. It is not their scientist haven't heard of scale-out. It is just that they have come to accept the world as Einstein described it.
Some comfort in the gloom
If you do the math, you find that a single PC has instructions to spare on each human being living on Earth for each second it is running (google on 'modern CPU' and 'MIPS'). If you do some more math, like taking the total turnover of Amazon.com (you can find it at wwww.nasdaq.com) divided by the price of an average book, you will find that the total number of sales transactions can fit in RAM of a single modern PC. The cool thing is that the number of items, customers, orders, products etc. occupies the same amount of space in 2012 as it did in 1950. Images, video and audio has increased in size, but numeric and textual information does not grow per item. Sure the number of transactions grows, but not in the same phase as computer power grows. So the logical solution is to scale out read-only and AP data and "scale-in/up" business data.
"Scale-in" instead of "scale-out"
Database engines and business logic running in a VM (like the Java VM or the .NET CLR) typically use fairly effective machine code. This means that moving memory is the overshadowing bottleneck of total throughput for a consistent database. This is often referred to as the memory wall (wikipedia has some useful information).
The trick is to transfer code to the database image instead of data from the database image to the code (if using a MVC or a MVVM pattern). This means that the consuming code executes in the same address space as the database image and that data is never moved (and the disk is merely securing transactions and images). Data can stay in the original database image and does not have to be copied into the memory of the application. Instead of treating the database as a RAM database, the database is treated as primary memory. Everything stays put.
Only data that is part of the final user response is moved out of the database image. For a large scale applications with hundreds of millions of simultaneous users this typically amounts to only a few million requests per second, something that a single PC has no problem with handling given that the HTTP packaging is done on gateway servers. Fortunately, such servers scales out beautifully as they don't need to share data.
As it turns out, the disk is fast at sequential writes so a raided disk can persist terabytes or changes every minute.
Horizontal scaling in Starcounter
Normally you do not scale a Starcounter node. It scales-in rather than out. This works well for a few million simultaneous users. To go above that, you need to add more Starcounter nodes. They can be used to partition data (but then you lose consistency and Starcounter is not designed for partitioning so it is less elegant than solutions such as Volt DB). So a better alternative is to use the additional Starcounter nodes as gateway servers. These servers simple accumulates all incoming HTTP requests for a millisecond at a time. This might sound like a short amount of time, but it is enough to accumulate thousands of request if you decided you need to scale Starcounter. The batch of requests are then sent to the ZLATAN node (Zero LATency Atomicity Node) a thousand times a second. Each such batch can contain thousands of requests. In this way, a few hundred million user sessions can be served by a single ZLATAN node. Although you can have several ZLATAN nodes, there is only one active ZLATAN node at a time. This is how the CAP theorem is honored. To go above that, you need to consider the same tradeoff as Facebook and others.
Another important note is that the ZLATAN node does not serve applications with data. Instead, the applications controller code is run by the ZLATAN node. The cost of serializing/deserializing and sending data to an application is far greater than to process the controller logic cycles. I.e. the code is sent to the database instead of the other way around (a traditional approach is that the applications asks for data or sends data).
Making the "shared-everything" node faster by doing less
The use of the database as a "heap" for the programming language instead of a remote system for serialization and deserialization is a trick that Starcounter calls VMDBMS. If the database is in RAM, you should not move data from one place in RAM to another place in RAM which is the case with most RAM databases.
There is no 'trick'. Starcounter is talking about speed, while CAP/NoSQL are talking about scalability. There is a trade-off between features+scalability vs speed.
Sometimes it's OK to ignore scalability if you can prove there are bottlenecks elsewhere. For instance, a new startup shouldn't worry about their website scaling to a million users, they should worry about getting their first hundred users. (Does anyone remember how often Twitter was down in the early days?) Starcounter can be useful if their transaction rate is much greater than your web page hit rate.
On the other hand, I don't trust anyone who lumps all "NoSQL" Databases together. The various NoSQL databases are more different than alike. They have radically different architectures and properties. Some of them scale to thousands of nodes, some of them don't scale beyond one node. Sometimes adding scalability slows you down. Sometimes removing features speeds you up.
http://strata.oreilly.com/2010/12/strata-gems-mysql-handlersocket.html