I am just on the way writing my bachelor thesis.
So therefore I am concerned with Eventually Consistency in theory and how Cassandra applies this theory.
To understand my problem, consider the following definitions of consistency (as far as I understood):
Causal Consistency:
A system provides causal consistency if memory operations that
potentially are causally related are seen by every node of the system
in the same order. (wikipedia)
So if a Process A writes a data X into the DB and after that a process B reads this data X and overwrites this with Y, then we say that a Causal Consistency is ensured if B gets X after A on all replicas (resp. nodes).
Read-your-write Consistency:
This is a special case of Causal Consistency. Hereby the reading and writing is processed on the same process A. This type of consistency ensures that A will never have an older object of data after the modification.
Session Consistency:
In this case a process A accesses the DB in a Session. As long as this Session exists, the system guarantess you a Read-your-write Consistency
Monotonic Read Consistency:
If a process gets a specific data object after reading, the system guarantees that a process on every subsequent Read-Access won't get an older data object.
Monotonic Write Consistency:
In this case the Write options to the DB will be done serialised whereby the order of the write options results which process came first to write.
Now that are some Consistency options in theory which some or one of them is implemented in the NoSQL-system. But please correct me if I understood something wrong.
My question is which type of Consistency is provided by CASSANDRA?
And how are these Consistencys related to the Rule "R+W>N" respectively "R+W<=N"
whereby
R=read replica count
W=write replica count
N=replication factor
I'd really appreciate a quick answer. Thank You!!!
Consistency levels in Cassandra can be set on any read or write query. This allows application developers to tune consistency on a per-query basis depending on their requirements for response time versus data accuracy. Cassandra offers a number of consistency levels for both reads and writes.
You should first understand about QUOROM
QUORUM is a good middle-ground ensuring strong consistency, yet still tolerating some level of failure.
A quorum is calculated as (rounded down to a whole number):
(replication_factor / 2) + 1
For example, with a replication factor of 3, a quorum is 2 (can tolerate 1 replica down). With a replication factor of 6, a quorum is 4 (can tolerate 2 replicas down).
For your question the explanation is given below
(nodes_read + nodes_written) > replication_factor
R + W > N
For example, if your application is using the QUORUM consistency level for both write and read operations and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes are always read. The combination of nodes written and read (4) being greater than the replication factor (3) ensures strong read consistency
You can read more about Quorom and consistency in detail in the link posted below
http://www.datastax.com/docs/1.1/dml/data_consistency
Related
Two points I don’t understand about RDBMS being CA in CAP Theorem :
1) It says RDBMS is not Partition Tolerant but how is RDBMS any less Partition Tolerant than other technologies like MongoDB or Cassandra? Is there a RDBMS setup where we give up CA to make it AP or CP?
2) How is it CAP-Available? Is it through master-slave setup? As in when the master dies, slave takes over writes?
I’m a novice at DB architecture and CAP theorem so please bear with me.
It is very easy to misunderstand the CAP properties, hence I'm providing some illustrations to make it easier.
Consistency: A query Q will produce the same answer A regardless the node that handles the request. In order to guarantee full consistency we need to ensure that all nodes agree on the same value at all times. Not to be confused with eventual consistency in which the network moves towards having all data consistent but there are periods of time in which it is not.
Availability: If the distributed system receives query Q it will always produce an answer for that query. This should not be confused with "high-availability", this is not about having the capacity to process a higher troughput of queries, it is about not refusing to answer.
Partition Tolerance: The system continues to function despite the existence of a partition. This is not about having mechanisms to "fix" the partition, it is about tolerating the partition, i.e. continuing despite the partition.
Note that the following examples do not cover all possible scenarios. Consider the following caption:
An example for CP:
The system is partition tolerant because its nodes keep accepting requests despite the partition; it is consistent because the only nodes providing answers are those that maintain a connection to the master node that handles all the write requests; it is not available because the nodes in the other partition do not provide an answer to the queries they receive.
Examples for AP:
Either because (respectively) we have the slave nodes replying to requests regardless whether they able to reach master or because the slave nodes in the other partition elect a new master, or because we have a masterless cluster, availability is achieved because all questions are getting an answer - consistency is dropped because both partitions are replying while potentially yielding different states.
Examples for CA:
If we disconnect nodes when a partition occurs, we can ensure that we have at most one partition which ultimately means that the network is not partitioned anymore, or simply there is no service at all. This is the opposite of partition tolerance, because the system is avoiding the partition instead of functioning despite it. Consistency and availability holds in these partially or fully disconnected systems because all working nodes (if any) have the same state and all received queries (if any) will get an answer - shutdown nodes do not receive queries.
To answer the questions:
Under default configurations, databases such as Cassandra and MongoDB are partition tolerant because they do not shutdown nodes to cope with partitions, whereas RDBMS such as MySQL do.
Availability has very little to do with master/slave setup, e.g. Cassandra is masterless and very available because it doesn't really matter which node dies. As for availability in a master/slave setup, there is no reason to stop responding to all queries when master is dead, but you may need to suspend write operations while electing a new one.
A lot of databases now actually have different configurations and depending on the settings you set, it can be either CA, CP, AP, etc but can not achieve all three at the same time. Some databases actually make an effort to support all three but still prioritizes them in a certain way.
For example, MySQL can be CP and CA depending on the configurations. By default, it is CA because it follows a master slave paradigm which data is replicated to the slaves. Partition tolerance is sacrificed in the event that a set of the slaves loses the connection to the master and therefore decides to elect a new master creating two masters with their own set of slaves.
However, MySQL also has another configuration which is a clustered configuration. It prioritizes CP over availability eg. the cluster will shutdown if there are not enough live nodes to serve all the data.
There are probably more configurations for MySQL that makes it satisfy other CAP theorem combinations but overall, I just wanted say that it depends on what your system requires. Sometimes databases are better for one configuration vs another so its best to see what kinds of problems that may also occur in using a certain configuration.
As for implementing the CAP theorem, I would advise taking a further look into different databases and how they implement the priorities for the CAP theorem. There are just too many different ways of implementing them eg. generally, the master slave model is used for CA systems, the hash ring for AP systems, etc.
CAP theorem is problematic and it applies only to distributed database systems. When you have distributed databases then network partition and node crashes can happen. And when network partition happens you must have partition tolerance (the P of your CAP).
So to answer your question number 1) It’s either CP or AP. It can be configured as Will mentioned.
More about why partition tolerance is a must:
https://codahale.com/you-cant-sacrifice-partition-tolerance/
More about problems around CAP theorem:
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
I agree that RDBMS can have all the properties of CAP. I have started studying noSQL DBs and had prior experience with IBM DB2.
Here is how IBM DB2 satisfies all the 3 CAP properties
C : Consistency : Every relational database satisfies this due to the transactional nature of RDBMS.
A : Availability : Availability means that when a query is made for a data that exists, it should be returned. Again, a relational database is designed to do this easily.
P : Partition Tolerance : This is the most interesting one. From DB2 stand point, in the application that I was working on, we had 2 databases spread across different data centres. One was the primary and communicated with the secondary via heartbeats. Each of these primary and secondary databases, had 12 physical instances where data was distributed on the basis of some predefined logic. If the primary goes down, the secondary detects this and takes the place of primary. Since the primary and secondary were always maintained in sync, data remains consistent as well.
This is how I think that RDBMS satisfies all 3 properties of CAP Theorem.
I may be wrong, and open to discussion on this.
If I understand the CAP Theorem correctly, availability means that the cluster continues to operate even if a node goes down.
I've seen a lot of people (http://blog.nahurst.com/tag/guide) list RDBMS as CA, but I do not understand how RBDMS is available, as if a node goes down, the cluster must go down to maintain consistency.
My only possible answer to this has been that most RDBMS are a single node, so there is no "non-failing" node. But, this seems to be a technicality, not true 'availability' and definitely not high availability.
Thank you.
First of all, let me clarify and state that the consistency in RDBMS is different than consistency in distributed systems. RDBMS (single system) applies consistency to transactional consistency, where as in distributed systems consistency means view from anywhere in the system (read from any node) is consistent. So RDMBS single node cannot be discussed with regards to CAP theorem. It is like comparing apple to orange.
RDBMS with master-slave can be compared to distributed systems. Here RDBMS can be configured to CA/CP or AP. MySQL for example, provides a way to configure the system in a way that if there is a quorum loss (not enough secondary available for commit log replication), the cluster is not available (CP system). MySQL also provides a configuration to allow the cluster to operate as long as master is available (CA system) with the potential of data loss. SQL Server AlwaysOn is an AP system, because commit log replication is asynchronous (even on sync replicas).
So RDBMS can be any of CA, CP or AP in a distributed world.
I believe you are misunderstanding the relation between CAP-Availability and node-UP/DOWN. Availability is about providing an answer to every received query - when a node is down it cannot receive queries, therefore if you bring down parts of or the entire cluster, the CAP-Availability property holds. Although this may sound counter intuitive at first glance, by shutting down nodes you are holding on to CAP-Availability and dropping CAP-Partition tolerance instead. I've recently posted an answer whose examples provide some clarification.
In a nutshell: A partition occurs that isolates node N. If N receives a request it can either: i) answer which grants availability but drops consistency because N is out of sync; ii) do not answer to avoid replying with an out-of-date result, thereby dropping availability because we received a request but issued no reply for it.
Alternatively we can shutdown N as soon as it becomes disconnected from the rest of the cluster which allows us to keep C and A, but drop P, because: i) N will not receive any requests; ii) all received requests will be performed to the fully connected and consistent cluster, hence they will all be answered with consistent values; iii) the cluster is not partition tolerant because it does not tolerate partitions - instead it shutdowns partitioned nodes.
In CAP Theorem P is for Partition tolerance , which is the ability of system to handle partitions(partitions are isolated clusters - due to network failure or any other reason ..).
In a distributed network to handle a partition , system has to pick either Consistency or Availability.
In case of RDBMS there is no chance for partitions (assuming not distributed which is normal case) ,So Those will be always CA.
I have an application for which I am tasked with designing a mongo backed data storage.
The application goals are to provide the latest data ( no stale data ) with the fastest load times.
The data size is in the order of a few millions with the application being write heavy.
In choosing what the read strategy is given a 3-node replica set ( 1 primary, 1 secondary, 1 arbiter ), I came across two different strategies to determine where to source the reads from -
Read from the secondary to reduce load on primary. With the writeConcern = REPLICA_SAFE, thus ensuring the writes are done on both primary and the secondary. Set the read preference. to secondaryPreferred.
Always read from primary. but ensure the data is in primary before reading. So set writeConcern= SAFE . The read preference is default - primaryPreferred .
What are the things to be considered before choosing one of the options.
According to the documentation REPLICA_SAFE is a deprecated term and should be replaced with REPLICA_ACKNOWLEDGED. The other problem here is that the w value here appears to be 2 from this constant.
This is a problem for your configuration, as you have your Primary and only one Secondary, combined with an arbiter. In the event of a node going down, or being otherwise unreachable, with the level set as this it is looking to acknowledge all writes from 2 nodes where there will not be 2 nodes available. You can leave write operations hanging in this way.
The better case for your configuration would be MAJORITY, as no matter the number of nodes it will ensure writes to the Primary and the "majority" of the secondaries. But in your case any write concern condition involving more than the PRIMARY will block on all writes, if one of your nodes is down or unavailable, as you would have to have at least two more secondary nodes available so that there would still be a "majority" of nodes to acknowledge the write. Or drop the ARBITER and have two SECONDARY nodes.
So you will have to stick to the default w=1 where all writes are acknowledged to the PRIMARY unless you can deal with writes failing when your one SECONDARY goes down.
You can set the read preference to secondaryPreferred as long as you accept that you can ""possibly" be reading stale or not the latest representation of your data as the only real guarantee is of a write to the Primary node. The general replication considerations remain, in that the nodes should be somewhat equal in processing capability or this can lead to lag or general performance degradation as a result of your query operations.
Remember that replication is implemented for redundancy and is not a system for improving performance. If you are looking for performance then perhaps look into scaling up your system hardware or implement sharding to distribute the load.
My question may sound too general, but I'm ready to give any missing data.
We make something like a social network. In order to make read performance better and to ease the life of master instance, we've set
readPreference=secondaryPreferred
in our replicaSet. But with this, there's no guarantee that the data is written to secondary instances before you read from there, so we had to set
w=3
option.
So far, everything seems to be working but measurements on my local replicaSet show the following insert statistics.
Inserting 300 objects:
w=1 - 0.10s
w=3 - 1.31s
Insertion 5000 objects:
w=1 - 0.6s
w=3 - 14.6s
The question is, is this difference expected, or I'm doing something wrong?
The difference in performance is expected because w=3 means that you want to wait for acknowledgement that data was successfully replicated to at least two of your secondaries in addition to the acknowledgement from your primary (w=1).
For clarity, w = 1 simply means that you want an acknowledgement from the primary that an operation was completed. Any errors such as duplicate key errors or network errors that occur would be reported back as part of the acknowledgement if occurred.
http://docs.mongodb.org/manual/reference/write-concern/
Refer to the link above, and you can see there are lower write concerns that let you trade safety for lower latency.
If you want higher level of durability or safety, then you might use j=1 to wait for an acknowledgement that your operation was written to the journal (allowing recovery from a failure). w > N increases safety by waiting for acknowledgement from > N replica members to ensure that your operation was successfully replicated to other members. So to be clear, w > 1 isn't necessary to instruct the driver to write to the replicas. If you decided to use w=N, be aware that you can get yourself in a bad situation if replica set members fail and fall below N. w = majority is a more flexible option.
Lastly, you may want to re-evaluate why you're reading from the secondaries. Secondaries are eventually consistent as MongoDB uses asynchronous replication. If you're expecting consistent reads, then it makes more sense to read from the primary. If your reason to read from the secondary is for scaling, you should consider sharding as this is the primary mechanism for scale-out. Distributing load on secondaries rarely improve scalability. Operations are replicated over to replicas, so you're not gaining much from a lower write load. Sometimes it makes sense to for distributing different types of workloads (may lead to better memory utilization). For instance, running a MR job on a secondary might make sense. Replica sets are primarily for high availability-- fault tolerance providing automatic fail-overs and network partition issues.
I am currently thinking of how to implement an authentication for a web application with a NoSQL solution. The problem I encounter hereby is that in most of the NoSQL solutions (e.g. Cassandra, MongoDB) have probably delayed writes. For example we write on node A but it is not guaranteed that the write is appearing on node B at the same time. This is logical with the approaches behind the NoSQL solutions.
Now one idea would be that you do no secondary reads (so everything goes over a master). This would probably work in MongoDB (where you actually have a master) but not in Cassandra (where all nodes are equal). But our application runs at several independent points all over the world, so we need multi master capability.
At the moment I am not aware of a solution with Cassandra where I could update data and be sure that subsequent reads (to all of the nodes) do have the change. So how could one build an authentication on top of those NoSQL solutions where the authentication request (read) could appear on several nodes in parallel?
Thanks for your help!
With respects to Apache Cassandra:
http://wiki.apache.org/cassandra/API#ConsistencyLevel
The ConsistencyLevel is an enum that controls both read and write behavior based on in your schema definition. The different consistency levels have different meanings, depending on if you're doing a write or read operation. Note that if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write. Of these, the most interesting is to do QUORUM reads and writes, which gives you consistency while still allowing availability in the face of node failures up to half of ReplicationFactor. Of course if latency is more important than consistency then you can use lower values for either or both.
This is managed on the application side. To your question specifically, it comes down to how you design your Cassandra implementation, replication factor across the Cassandra nodes and how your application behaves on read/writes.
Write
ANY: Ensure that the write has been written to at least 1 node, including HintedHandoff recipients.
ONE: Ensure that the write has been written to at least 1 replica's commit log and memory table before responding to the client.
QUORUM: Ensure that the write has been written to N / 2 + 1 replicas before responding to the client.
LOCAL_QUORUM: Ensure that the write has been written to / 2 + 1 nodes, within the local datacenter (requires NetworkTopologyStrategy)
EACH_QUORUM: Ensure that the write has been written to / 2 + 1 nodes in each datacenter (requires NetworkTopologyStrategy)
ALL: Ensure that the write is written to all N replicas before responding to the client. Any unresponsive replicas will fail the operation.
Read
ANY: Not supported. You probably want ONE instead.
ONE: Will return the record returned by the first replica to respond. A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. This means subsequent calls will have correct data even if the initial read gets an older value. (This is called ReadRepair)
QUORUM: Will query all replicas and return the record with the most recent timestamp once it has at least a majority of replicas (N / 2 + 1) reported. Again, the remaining replicas will be checked in the background.
LOCAL_QUORUM: Returns the record with the most recent timestamp once a majority of replicas within the local datacenter have replied.
EACH_QUORUM: Returns the record with the most recent timestamp once a majority of replicas within each datacenter have replied.
ALL: Will query all replicas and return the record with the most recent timestamp once all replicas have replied. Any unresponsive replicas will fail the operation.