If there is a database system that can handle both ACID transactions and non ACID transactions, would that invalidate the CAP theorem which states that you have to make a trade off between A (availability) and C(consistency) in the presence of partitions
Related
When using a SQL journal in an event sourced system, is it ok to update the journal and one or more projections in the same transaction ? Is it an anti pattern ?
The pros I can think of is this the consistency of the view is immediate, but what are the cons ? Performance ?
Scalability is the biggest price.
But interestingly you say it's 'immediate' but that isn't strictly true.
There is still a delay while the transaction completes and depending on how long that takes or how many transactions per second are being made, you run the risk of unnecessary concurrency conflicts.
You still have to pay the "eventual consistency" price, you just pay it in a blocking way (nothing wrong with that if it's a conscious choice).
Hope that helps.
If you have scaled SQL server with one DB for writes and multiple DBs for reads. Wouldn't there be a delay for data to be replicated from the write DB to the to other read databases? In which case isn't the data inconsistent?
So where would a scaled relational DB fall in the CAP theorem?
Update:
In relational DBs consistency means there wont be partial updates. For example if someone transfers money from one account to another and the whole thing is a part of one transaction, it wont happen that you take money out of one account but doesn't show up in another account.
In CAP theorem consistence means all the components see the same data. That consistency is different from consistency in ACID.
From what I know, relational DBs like SQL server are supposed to be CA (consistent and available). This would make sense if there is just one database. Because everyone would see the same data. But what if the SQL server is scaled with multiple databases? In that case would all databases still see the same data? If not, would it be consistent (in CAP theorem)?
My feeling is a scaled relational DB is AP (Available and partition tolerant) and not CA (Consistent and available).
I've read different definitions of consistency in regards to the CAP theorem.
Some definitions of consistency say that once some data is persisted in a system, all reads will read the most recently written data. In this definition, a replicated database (you call this "scaled" but I wouldn't use that term) has a risk of returning inconsistent data, if the replication is asynchronous.
To mitigate this risk, some systems make sure replication is synchronous, or as close to synchronous as they can implement. Galera, for example, sends transaction write sets to its replicas synchronously. If you try to read from the replica, and it detects that there are write sets pending but not yet applied, it can block your read until it has caught up with the pending write sets (this behavior is configurable). So you'll never read data that is out of date.
The cost of maintaining perfectly consistent reads over distributed systems in this manner is usually more expensive than users want. It will become a performance bottleneck in a system that has a high rate of updates. So for practical reasons, most projects accept that "replication lag" is a necessary compromise.
Other definitions of consistency are closer to atomicity, i.e. transactions will not be persisted in a partially-complete state. So all constraints will be satisfied when you read the data, whether you read the data before or after the transaction is applied. In this definition, it's quite easy to imagine the replica database instance remaining consistent, if it applies updates using the same transaction semantics used on the master. If you read data from the replica, you might read data that hasn't yet had the latest updates applied, but it will never be in an inconsistent state with respect to constraints.
There is nothing called a scaled RDBMS. We do have "RDBMS Clusters with shared storage": here can keep on adding nodes to achieve high availability of RDBMS.
In other words:
If you meant a "Distributed RDBMS" by mentioning "Scaled RDBMS" - it doesn't exist. You can have RDBMS on only one node. If you add another node, then that will be "another" RDBMS and it would NOT coalesce with the first one giving you a single view(unlike a typical NoSQL Database). Although, you can happily keep on adding storage nodes behind the RDBMS.
If I understand the CAP Theorem correctly, availability means that the cluster continues to operate even if a node goes down.
I've seen a lot of people (http://blog.nahurst.com/tag/guide) list RDBMS as CA, but I do not understand how RBDMS is available, as if a node goes down, the cluster must go down to maintain consistency.
My only possible answer to this has been that most RDBMS are a single node, so there is no "non-failing" node. But, this seems to be a technicality, not true 'availability' and definitely not high availability.
Thank you.
First of all, let me clarify and state that the consistency in RDBMS is different than consistency in distributed systems. RDBMS (single system) applies consistency to transactional consistency, where as in distributed systems consistency means view from anywhere in the system (read from any node) is consistent. So RDMBS single node cannot be discussed with regards to CAP theorem. It is like comparing apple to orange.
RDBMS with master-slave can be compared to distributed systems. Here RDBMS can be configured to CA/CP or AP. MySQL for example, provides a way to configure the system in a way that if there is a quorum loss (not enough secondary available for commit log replication), the cluster is not available (CP system). MySQL also provides a configuration to allow the cluster to operate as long as master is available (CA system) with the potential of data loss. SQL Server AlwaysOn is an AP system, because commit log replication is asynchronous (even on sync replicas).
So RDBMS can be any of CA, CP or AP in a distributed world.
I believe you are misunderstanding the relation between CAP-Availability and node-UP/DOWN. Availability is about providing an answer to every received query - when a node is down it cannot receive queries, therefore if you bring down parts of or the entire cluster, the CAP-Availability property holds. Although this may sound counter intuitive at first glance, by shutting down nodes you are holding on to CAP-Availability and dropping CAP-Partition tolerance instead. I've recently posted an answer whose examples provide some clarification.
In a nutshell: A partition occurs that isolates node N. If N receives a request it can either: i) answer which grants availability but drops consistency because N is out of sync; ii) do not answer to avoid replying with an out-of-date result, thereby dropping availability because we received a request but issued no reply for it.
Alternatively we can shutdown N as soon as it becomes disconnected from the rest of the cluster which allows us to keep C and A, but drop P, because: i) N will not receive any requests; ii) all received requests will be performed to the fully connected and consistent cluster, hence they will all be answered with consistent values; iii) the cluster is not partition tolerant because it does not tolerate partitions - instead it shutdowns partitioned nodes.
In CAP Theorem P is for Partition tolerance , which is the ability of system to handle partitions(partitions are isolated clusters - due to network failure or any other reason ..).
In a distributed network to handle a partition , system has to pick either Consistency or Availability.
In case of RDBMS there is no chance for partitions (assuming not distributed which is normal case) ,So Those will be always CA.
Mongo documentation says single document write are atomic but at another place it mentions interleaved transactions may read uncommitted data and before the writer thread has returned.
I understand that other transactions can read uncommitted data because the write may not be still committed to the journal.
But how can threads read data while the writer thread has not returned. Is it for cases when the write concern is not default?
Thanks
Ankur
Ok with the reference I can now get a context and tell you what this is about.
Mongo documentation says single document write are atomic
Yep
it mentions interleaved transactions may read uncommitted
Infact any read may get uncommited data. This is because MongoDB will write to the fsync queue BEFORE it writes to disk.
MongoDB can read from this fsync queue before it goes to disk, and quoting the page:
Other database systems refer to these isolation semantics as read uncommitted.
Mainly ACID databases do.
But how can threads read data while the writer thread has not returned.
Thanks to MongoDBs con-currency rules: http://docs.mongodb.org/manual/faq/concurrency/#does-a-read-or-write-operation-ever-yield-the-lock
In short, to sum up: The write will not take exclusive lock for the duration of its running, instead it can subside (due to various rules) to a read allowing you to return data half way through a write.
This is also why you must sometimes be careful about multi-document updates and other threads of your application reading data, it may actually get one half of data that is upto date and the other half which is not.
In section 4.5 Combining Sharding and Replication of the NoSQL Distilled book, the following assertion is made:
"Using peer-to-peer replication and sharding is a common strategy for column-family databases."
The statement leaves out other types of cluster-ready databases, namely key-value and document stores. Why is this the case? Are those databases well suited for sharding, but not in conjunction with peer-to-peer replication? Is master-slave replication a better approach in those cases?
Peer-to-peer replication has more to do with the consistency model. You're making a tradeoff between fault tolerance and consistency, where a peer-to-peer model chooses the former and a master-slave model the latter. It is possible to achieve consistency through means such as using quorum reads and writes, so you can often achieve both in practice--even though the database isn't technically consistent.
There are certainly examples of non-CF stores that use peer-to-peer replication, such as CouchBase, a document store, and Riak, a KV store. These databases perform very well and use auto-sharding in some form.