CAP Theorem - What are the reasons for partitioning in first place? - distributed-computing

There are plenty of good stackoverflow Q&A on the CAP theorem as CP vs AP etc.
In a nutshell the theorem states:
"In the presence of a partition you must sacrifice availability or consistency"
Lets imagine we speak about storage, databases in particular.
What are the technical reasons to Partition in first place?
(I'll try to take some guesses below):
OS can handle only so many ports/system handles.
Single "N Petabyte" Hard discs do not exist, you need more, until you run out of SATA/PCI ports.
Bringing the data closer to the consumer.
Single Database size is limited to size X.

Please note that there is a difference in meaning between "partitioning" as per CAP and "partitioning" as per physical database design.
"partitioning" as per CAP refers to what happens when a node in a distributed system becomes unavailable/unreachable, thus refers to phenomena that happen "at run time".
"partitioning" as per physical database design refers to the design decision to distribute the physical records representing the rows of one single table across various distinct physical stores, but 'distinct' might even then still mean only 'various distinct segments of one single store'. Anyhow, it thus refers to things that happen at design time.
In particular, it means that if you do "partitioning" as per physical database design, this does not necessarily lead to the existence of a "distributed system" in the sense of CAP. In particular, when "partitioning" as per physical database design, you do not necessarily create a system with various distinct runtime components operating "independently" : if you partition a table, you'll still typically have only one single DBMS that you are communicating with, thus only one sole runtime component.
Also in particular, if you "partition" as per physical database design, it is wrong to conclude that because of CAP theorem, consistency must necessarily have been sacrificed.

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

Is it still possible for a transaction that involves read and write operations against the replicated database to commit?

I have a question that I couldn't find although I read the whole book.
Consider a distributed system in which the database is replicated over five servers. At one point, the network between the replicated servers makes three of servers isolated from the remaining two. Is it still possible for a transaction that involves read and write operations against the replicated database to commit? Motivate
I would appreciate if you could answer this question
This is a bit general question. When I rephrase your question you ask what happens when there a split brain occurs in the distributed database.
And answer is it depends ;-)
Really. It depends what type of the distributed database you work with. From a high level perspective it depends what trade-off from the CAP the database chose - is the database CA or CP system?
(I use this differentiation for sake of brevity, see https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html)
If it's a CA system then for sure you can read and write to any partition. When the split brain is resolved then the database has some recovery tooling that mends the partitions back to a consistent state. Or the database may left this responsibility for the user that will get set of possible values and he has to decide which one is correct.
If it's a CP system then we can say that it depends if you work with the partition which consist the 3 servers of 5 (you work with the majority). Then you may read and write. If your client connects to the minor partition (2 servers of 5) then you probably are not permitted to read and write. But it depends on the consistency you requires. If you requires lineralizability then the client can do neither reads nor writes.

Where would a scaled relational DB fall in the CAP theorem?

If you have scaled SQL server with one DB for writes and multiple DBs for reads. Wouldn't there be a delay for data to be replicated from the write DB to the to other read databases? In which case isn't the data inconsistent?
So where would a scaled relational DB fall in the CAP theorem?
Update:
In relational DBs consistency means there wont be partial updates. For example if someone transfers money from one account to another and the whole thing is a part of one transaction, it wont happen that you take money out of one account but doesn't show up in another account.
In CAP theorem consistence means all the components see the same data. That consistency is different from consistency in ACID.
From what I know, relational DBs like SQL server are supposed to be CA (consistent and available). This would make sense if there is just one database. Because everyone would see the same data. But what if the SQL server is scaled with multiple databases? In that case would all databases still see the same data? If not, would it be consistent (in CAP theorem)?
My feeling is a scaled relational DB is AP (Available and partition tolerant) and not CA (Consistent and available).
I've read different definitions of consistency in regards to the CAP theorem.
Some definitions of consistency say that once some data is persisted in a system, all reads will read the most recently written data. In this definition, a replicated database (you call this "scaled" but I wouldn't use that term) has a risk of returning inconsistent data, if the replication is asynchronous.
To mitigate this risk, some systems make sure replication is synchronous, or as close to synchronous as they can implement. Galera, for example, sends transaction write sets to its replicas synchronously. If you try to read from the replica, and it detects that there are write sets pending but not yet applied, it can block your read until it has caught up with the pending write sets (this behavior is configurable). So you'll never read data that is out of date.
The cost of maintaining perfectly consistent reads over distributed systems in this manner is usually more expensive than users want. It will become a performance bottleneck in a system that has a high rate of updates. So for practical reasons, most projects accept that "replication lag" is a necessary compromise.
Other definitions of consistency are closer to atomicity, i.e. transactions will not be persisted in a partially-complete state. So all constraints will be satisfied when you read the data, whether you read the data before or after the transaction is applied. In this definition, it's quite easy to imagine the replica database instance remaining consistent, if it applies updates using the same transaction semantics used on the master. If you read data from the replica, you might read data that hasn't yet had the latest updates applied, but it will never be in an inconsistent state with respect to constraints.
There is nothing called a scaled RDBMS. We do have "RDBMS Clusters with shared storage": here can keep on adding nodes to achieve high availability of RDBMS.
In other words:
If you meant a "Distributed RDBMS" by mentioning "Scaled RDBMS" - it doesn't exist. You can have RDBMS on only one node. If you add another node, then that will be "another" RDBMS and it would NOT coalesce with the first one giving you a single view(unlike a typical NoSQL Database). Although, you can happily keep on adding storage nodes behind the RDBMS.

Why can't CP systems also be CAP?

My understanding of the CAP acronym is as follows:
Consistent: every read gets the most recent write
Available: every node is available
Partion Tolerant: the system can continue upholding A and C promises when the network connection between nodes goes down
Assuming my understanding is more or less on track, then something is bother me.
AFAIK, availability is achieved via any of the following techniques:
Load balancing
Replication to a disaster recovery system
So if I have a system that I already know is CP, why can't I "make it full CAP" by applying one of these techniques to make it available as well? I'm sure I'm missing something important here, just not sure what.
It's the partition tolerance, that you got wrong.
As long as there isn't any partitioning happening, systems can be consistent and available. There are CA systems which say, we don't care about partitions. You can have them running inside racks with server hardware and make partitioning extremely unlikely. The problem is, what if partitions occur?
The system can either choose to
continue providing the service, hoping the other server is down rather than providing the same service and serving different data - choosing availability (AP)
stop providing the service, because it couldn't guarantee consistency anymore, since it doesn't know if the other server is down or in fact up and running and just the communication between these two broke off - choosing consistency (CP)
The idea of the CAP theorem is that you cannot provide both Availability AND Consistency, once partitioning occurs, you can either go for availability and hope for the best, or play it safe and be unavailable, but consistent.
Here are 2 great posts, which should make it clear:
You Can’t Sacrifice Partition Tolerance shows the idea, that every truly distributed system needs to deal with partitioning now and than and hence CA systems will break instantly at the first occurrence of a partition
CAP Twelve Years Later: How the "Rules" Have Changed is slightly more up to date and shows the CAP theorem more flexible, where developers can choose how applications behave during partitioning and can sacrifice a bit of consistency to gain some availability, ...
So to finally answer your question, if you take a CP system and replicate it more often, you might either run into overhead of messages sent between the nodes of the system to keep it consistent, or - in case a substantial part of the nodes fails or network partitioning occurs without any part having a clear majority, it won't be able to continue operation as it wouldn't be able to guarantee consistency anymore. But yes, these lines are getting more blurred now and I think the references I've provided will give you a much better understanding.

StarCounter and CAP

I have been reading about a database named Starcounter. It makes a claim that it can handle loads that a "NoSql"-database only can handle without dropping consistency. As far as I understand the CAP-theorem, if you keep consistency, you lose availability or partition tolerance. So what trick makes StarCounter work?
I can imagine that StarCounter is fast, but the claim that NoSql needs to drop consistency to keep up seems a little bit strange to me. Can anyone please explain?
Thanks in advance
Roland
The short answer
The CAP theorem (aka Brewers theorem) cannot be beaten for a single piece of information (like a consistent database). If you have a horizontally scaled database, you won't get consistency and performance. This conclusion comes from the laws of physics and can be deducted from Brewers theorem and Einsteins theories of relativity. You need to scale-in/up, not out. Not very "cloudy", but as the enemies of Galileo would probably confess if they were alive today, nature does a poor job at honouring human fashion.
Scaling consistent data
I'm sure there are other approaches, but Starcounter works by hosting the database image in RAM. Instead of moving database data to the application code, parts of the application code is moved to the database. Only data in the final response gets moved from the original place in RAM memory (where the data was in the first place). This makes most of the data stay put even if there are millions of requests processed every second. The downside is that the database needs to know the programming language of your application logic. The upside, however, is obvious if you have ever tried to serve millions of HTTP requests/sec, each requiring extensive database access.
A more thourough answer
The question is a good one. It is no wonder you find it strange as it was only a few years back that CAP was proven (turned into a theorem). Many developers are as disappointed as a kid would be when theoretical physicist tells him to stop looking for the perpetual motion machine because it cannot work. We still want the scale-out consistent database, don't we?
The CAP theorem
The CAP theorem gives that any piece of information cannot have consistency (C), availability (A) and partition tolerance (P). It applies to a unit of information (such as a database). You can of course have independent pieces of information that operates differently. One piece could be AP, another could be CA and a third could be CP. You just cant have the same information being CAP.
The problem with the impossibility of the 'P' in a consistent and available database results in how a scaled-out database MUST do signalling between the nodes. The conclusion must be, that even in a hundred years from now, CAP gives that a single piece of consistent data will have to live on hardware interconnected using hard wires or light beams.
The problem with the P in CAP
The problem lies in performance if you apply horizontal scaling to an available consistent database. A good performance was the very reason to do horizontally scaling in the first place, this is a very bad thing. As every node needs communicate with the other nodes whenever there is database access to achieve consistency, and given the fact that signalling is ultimately limited by the speed of light, you are left with sad but true fact that database scientist (as well as CPU scientists) are not just being stubborn for failing to see scale-out as a a magical silver bullet. It will not happen because it cannot happen (however, parts of your database could be placed in a AP set, so remember, we are talking about consistent data here). Adding the theories of Einstein to the CAP theorem and the small box wins of the cloudy data-center for consistent data.
Perpetual machines and CAP
The state of things in the database community is a little bit like the state of perpetual motion machines when horse and carriage was the way to get to work. Without any theoretical evidence against it, the patent offices granted hundreds of patents for impossible perpetual machines. Today, we may laugh at this, but we have a similar situations in the database industry with consistent scale-out databases. When you hear somebody claim that they have a scale-out ACID database, be cautious. It was only after the dot com crash mathematicians at MIT proved Brewer right at the CAP theorem was officially born, so the hunt for the impossible has unfortunately not died off just yet. You can compare this, if you want, to the way laggards kept trying to invent the perpetual machine for years after modern theoretical physics should reasonably have put a stop to it. Old habits die hard (my apologies to anyone on Stack overflow still making drawings of bearings and arms moving ad finitum on their own accord - I don't mean to be offensive).
CAP and performance
All is not lost however. Not all pieces of information needs to be consistent. Not all pieces needs to scale-out. You just have the accept Brewers theorem and make the best out of it.
For applications such as Facebook, consistency is dropped. This is okay as data is entered once and then is manipulated by a single users. Still we can experience the side effects in everyday Facebook usage such as things popping in and out of existence for a while.
However, in most business applications, data needs to be correct. The sum of all accounts in your bookkeeping needs to amount to zero. Your stock inventory must equal to 8 if you sold 2 out of 10 items even if there are multiple users buying from the same stock.
The problem with scaling out available data is that you have to make do without partition tolerance. This fancy word simply means that you have to signal between the nodes in your cloud at all times. And as it takes light a few nanoseconds to travel a single meter, this becomes impossible without making your scale-out result in less performance rather than more performance. Of course, this is only true for consistent data. The implications of this has been known by the engineers of Intel, AMD, Oracle et. al for a long time. It is not their scientist haven't heard of scale-out. It is just that they have come to accept the world as Einstein described it.
Some comfort in the gloom
If you do the math, you find that a single PC has instructions to spare on each human being living on Earth for each second it is running (google on 'modern CPU' and 'MIPS'). If you do some more math, like taking the total turnover of Amazon.com (you can find it at wwww.nasdaq.com) divided by the price of an average book, you will find that the total number of sales transactions can fit in RAM of a single modern PC. The cool thing is that the number of items, customers, orders, products etc. occupies the same amount of space in 2012 as it did in 1950. Images, video and audio has increased in size, but numeric and textual information does not grow per item. Sure the number of transactions grows, but not in the same phase as computer power grows. So the logical solution is to scale out read-only and AP data and "scale-in/up" business data.
"Scale-in" instead of "scale-out"
Database engines and business logic running in a VM (like the Java VM or the .NET CLR) typically use fairly effective machine code. This means that moving memory is the overshadowing bottleneck of total throughput for a consistent database. This is often referred to as the memory wall (wikipedia has some useful information).
The trick is to transfer code to the database image instead of data from the database image to the code (if using a MVC or a MVVM pattern). This means that the consuming code executes in the same address space as the database image and that data is never moved (and the disk is merely securing transactions and images). Data can stay in the original database image and does not have to be copied into the memory of the application. Instead of treating the database as a RAM database, the database is treated as primary memory. Everything stays put.
Only data that is part of the final user response is moved out of the database image. For a large scale applications with hundreds of millions of simultaneous users this typically amounts to only a few million requests per second, something that a single PC has no problem with handling given that the HTTP packaging is done on gateway servers. Fortunately, such servers scales out beautifully as they don't need to share data.
As it turns out, the disk is fast at sequential writes so a raided disk can persist terabytes or changes every minute.
Horizontal scaling in Starcounter
Normally you do not scale a Starcounter node. It scales-in rather than out. This works well for a few million simultaneous users. To go above that, you need to add more Starcounter nodes. They can be used to partition data (but then you lose consistency and Starcounter is not designed for partitioning so it is less elegant than solutions such as Volt DB). So a better alternative is to use the additional Starcounter nodes as gateway servers. These servers simple accumulates all incoming HTTP requests for a millisecond at a time. This might sound like a short amount of time, but it is enough to accumulate thousands of request if you decided you need to scale Starcounter. The batch of requests are then sent to the ZLATAN node (Zero LATency Atomicity Node) a thousand times a second. Each such batch can contain thousands of requests. In this way, a few hundred million user sessions can be served by a single ZLATAN node. Although you can have several ZLATAN nodes, there is only one active ZLATAN node at a time. This is how the CAP theorem is honored. To go above that, you need to consider the same tradeoff as Facebook and others.
Another important note is that the ZLATAN node does not serve applications with data. Instead, the applications controller code is run by the ZLATAN node. The cost of serializing/deserializing and sending data to an application is far greater than to process the controller logic cycles. I.e. the code is sent to the database instead of the other way around (a traditional approach is that the applications asks for data or sends data).
Making the "shared-everything" node faster by doing less
The use of the database as a "heap" for the programming language instead of a remote system for serialization and deserialization is a trick that Starcounter calls VMDBMS. If the database is in RAM, you should not move data from one place in RAM to another place in RAM which is the case with most RAM databases.
There is no 'trick'. Starcounter is talking about speed, while CAP/NoSQL are talking about scalability. There is a trade-off between features+scalability vs speed.
Sometimes it's OK to ignore scalability if you can prove there are bottlenecks elsewhere. For instance, a new startup shouldn't worry about their website scaling to a million users, they should worry about getting their first hundred users. (Does anyone remember how often Twitter was down in the early days?) Starcounter can be useful if their transaction rate is much greater than your web page hit rate.
On the other hand, I don't trust anyone who lumps all "NoSQL" Databases together. The various NoSQL databases are more different than alike. They have radically different architectures and properties. Some of them scale to thousands of nodes, some of them don't scale beyond one node. Sometimes adding scalability slows you down. Sometimes removing features speeds you up.
http://strata.oreilly.com/2010/12/strata-gems-mysql-handlersocket.html