NoSQL: What does it mean for MongoDB or BigTable to not always be "Available" - mongodb

Reading Nathan Hurst's Visual Guide to NoSQL Systems, he includes the CAP triangle:
Consistency
Availibility
Partition Tolerance
With SQL Server being an AC system, and MongoDB being a CP system.
These definitions from come a UC Berkley professor Eric Brewer, and his talk at PODC 2000 (Principles of Distributed Computing):
Availability
Availability means just that - the service is available
(to operate fully or not as above). When you buy the book you want to
get a response, not some browser message about the web site being
uncommunicative. Gilbert & Lynch in their proof of CAP Theorem make
the good point that availability most often deserts you when you need
it most - sites tend to go down at busy periods precisely because they
are busy. A service that's available but not being accessed is of no
benefit to anyone.
What does it mean, in the context of MongoDB, or BigTable, for the system to not be "available"?
Do you go to connect (e.g. over TCP/IP), and the server does not respond? Do you attempt execute a query, but the query never returns - or returns an error?
What does it mean to not be available?

Availability in this case means that in the event of a network partition, the server that a client connects to may not be able to guarantee the level of consistency that the client expects (or that the system is configured to provide).
Assuming that you have 3 nodes, A, B, and C, in a hypothetical distributed system. A, B, and C are each running in their own rack of servers, with 2 switches between them:
[Node A] <- Switch #1 -> [Node B] <- Switch #2 -> [ Node C ]
Now assume that said system is set up so that it is GUARANTEED that any write will go to at least 2 nodes before it is considered committed. Now, lets assume that switch #2 gets unplugged, and some client is connected to node C:
[Node A] <- Switch #1 -> [Node B] [ Node C ] <-- Some client
That client will not be able to issue Consistent writes, because the distributed system is currently in a partitioned state (namely, Node C cannot contact enough other nodes to guarantee the 2-node consistency required).
I'd add to this that some NoSQL databases allow very dynamic selection of CAP attributes. Cassandra, for instance, allows clients to specify the number of servers that a write must go to before it is committed on a per-write basis. Writes going to a single server are "AP", writes going to a quorum (or all) servers are more "CA".
EDIT - from the comments below:
In MongoDB you can only have master/slave configuration within a replica set. What this means is that the choice of AP vs CP is made by the client at query time. The client can specify slaveOk, which will read from an arbitrarily selected slave (which may have stale data): mongodb.org/display/DOCS/…. If the client is not OK with stale data, don't specify slaveOk and the query will go to the master. If the client cannot reach the master, then you'll get an error. I'm not sure exactly what that error will be.

The CAP theorem applies to distributed computer systems. MongoDB supports two distinct forms of distributed computing: sharding for horizontal scaling and replica sets for failover/high availability. The two can be used together or independently. I think the CAP theorem applies slightly differently to the two forms:
Sharding level - MongoDB stores data on at most one authoritative shard.
Strong Consistency: A piece of data exists on at most one shard. Incorrect/stale data does not exist.
Strong Partition-tolerance: Even if network partitioned, requests never return incorrect/stale data. Shards continue working independent of other shards.
Weak Availability: Reads/writes of data on a downed shard will fail.
Replica set level - MongoDB replicates data within a shard, ensuring consistency via a single, authoritative primary node.
Strong Consistency: All reads/writes handled by the primary node.
Strong Partition-tolerance: If enough nodes become unreachable, a new primary is elected. The election process ensures there is always at most one primary node.
Weak Availability: Reads/writes will fail when no primary exists, even though the data could be accessed via secondary nodes.
The slaveOK/ReadPreference.SECONDARY option sacrifices some consistency (stale data can be read) for increased performance and availability.

Related

Sharding with replication

Sharding with replication]1
I have a multi tenant database with 3 tables(store,products,purchases) in 5 server nodes .Suppose I've 3 stores in my store table and I am going to shard it with storeId .
I need all data for all shards(1,2,3) available in nodes 1 and 2. But node 3 would contain only shard for store #1 , node 4 would contain only shard for store #2 and node 5 for shard #3. It is like a sharding with 3 replicas.
Is this possible at all? What database engines can be used for this purpose(preferably sql dbs)? Did you have any experience?
Regards
I have a feeling you have not adequately explained why you are trying this strange topology.
Anyway, I will point out several things relating to MySQL/MariaDB.
A Galera cluster already embodies multiple nodes (minimum of 3), but does not directly support "sharding". You can have multiple Galera clusters, one per "shard".
As with my comment about Galera, other forms of MySQL/MariaDB can have replication between nodes of each shard.
If you are thinking of having a server with all data, but replicate only parts to readonly Replicas, there are settings for replicate_do/ignore_database. I emphasize "readonly" because changes to these pseudo-shards cannot easily be sent back to the Primary server. (However see "multi-source replication")
Sharding is used primarily when there is simply too much traffic to handle on a single server. Are you saying that the 3 tenants cannot coexist because of excessive writes? (Excessive reads can be handled by replication.)
A tentative solution:
Have all data on all servers. Use the same Galera cluster for all nodes.
Advantage: When "most" or all of the network is working all data is quickly replicated bidirectionally.
Potential disadvantage: If half or more of the nodes go down, you have to manually step in to get the cluster going again.
Likely solution for the 'disadvantage': "Weight" the nodes differently. Give a height weight to the 3 in HQ; give a much smaller (but non-zero) weight to each branch node. That way, most of the branches could go offline without losing the system as a whole.
But... I fear that an offline branch node will automatically become readonly.
Another plan:
Switch to NDB. The network is allowed to be fragile. Consistency is maintained by "eventual consistency" instead of the "[virtually] synchronous replication" of Galera+InnoDB.
NDB allows you to immediately write on any node. Then the write is sent to the other nodes. If there is a conflict one of the values is declared the "winner". You choose which algorithm for determining the winner. An easy-to-understand one is "whichever write was 'first'".

How mongo db primary is selected when environment is restarted?

I have set up an mongo cluster with 3 servers say A,B,C. My application writes lot of data db, hence oplog size is always very high.
When environment is restarted, Server A is primary ( we always run the rs.add(a), rs.add(B), rs.add(C) on A) , then Server C becomes primary due to xyz reason - may be due to server reboot oo temporary connection loss. Then application has written data to Server C , and Server A is still trying to remain in sync with Server C.
Now environment is restarted, my understanding is that server C should be chosen as primary. However server A is being chosen as primary.
Could anyone please explain the reason why Server A is chosen as primary.
Assuming that you are using majority write concern for data that you actually care about, server C would not have acknowledge the writes from the application until at least one of the secondary nodes also wrote the data.
A 3-node replica set needs to have at least 2 nodes online in order to elect a primary, and the election process will favor the node(s) with the most recent timestamp.
If nodes A and B came online slightly faster than C, they might complete an election before C is available.
In this situation, if you were using majority write concern there would be no data loss as either A or B would also contain everything that had been written to and acknowledged by C.
However, if you were using write concern with w:1 or w:0, there might be writes that C acknowledged that neither A nor B contain. When A and B come online, they would still elect a primary because neither knows anything about the additional data on C. When C then becomes available, if there have been any writes, C will have to conduct rollback in order to participate in the replica set.
As to why server A became primary in your cluster, check the mongod logs. In modern version of MongoDB election information is logged so you will be able to determine which nodes participated in the election, and which nodes voted for server A.
If you want a specific server to be preferred for the primary role, you can set member priorities accordingly. Absent priorities, MongoDB is free to have any of the members of the replica set as the primary.
Setting priorities does not guarantee that the server you want is ALWAYS going to be the primary - if this server gets restarted some other node will be a primary, and when this server comes back up it will not be eligible to be a primary again until it syncs with the current primary.

Why isn't RDBMS Partition Tolerant in CAP Theorem and why is it Available?

Two points I don’t understand about RDBMS being CA in CAP Theorem :
1) It says RDBMS is not Partition Tolerant but how is RDBMS any less Partition Tolerant than other technologies like MongoDB or Cassandra? Is there a RDBMS setup where we give up CA to make it AP or CP?
2) How is it CAP-Available? Is it through master-slave setup? As in when the master dies, slave takes over writes?
I’m a novice at DB architecture and CAP theorem so please bear with me.
It is very easy to misunderstand the CAP properties, hence I'm providing some illustrations to make it easier.
Consistency: A query Q will produce the same answer A regardless the node that handles the request. In order to guarantee full consistency we need to ensure that all nodes agree on the same value at all times. Not to be confused with eventual consistency in which the network moves towards having all data consistent but there are periods of time in which it is not.
Availability: If the distributed system receives query Q it will always produce an answer for that query. This should not be confused with "high-availability", this is not about having the capacity to process a higher troughput of queries, it is about not refusing to answer.
Partition Tolerance: The system continues to function despite the existence of a partition. This is not about having mechanisms to "fix" the partition, it is about tolerating the partition, i.e. continuing despite the partition.
Note that the following examples do not cover all possible scenarios. Consider the following caption:
An example for CP:
The system is partition tolerant because its nodes keep accepting requests despite the partition; it is consistent because the only nodes providing answers are those that maintain a connection to the master node that handles all the write requests; it is not available because the nodes in the other partition do not provide an answer to the queries they receive.
Examples for AP:
Either because (respectively) we have the slave nodes replying to requests regardless whether they able to reach master or because the slave nodes in the other partition elect a new master, or because we have a masterless cluster, availability is achieved because all questions are getting an answer - consistency is dropped because both partitions are replying while potentially yielding different states.
Examples for CA:
If we disconnect nodes when a partition occurs, we can ensure that we have at most one partition which ultimately means that the network is not partitioned anymore, or simply there is no service at all. This is the opposite of partition tolerance, because the system is avoiding the partition instead of functioning despite it. Consistency and availability holds in these partially or fully disconnected systems because all working nodes (if any) have the same state and all received queries (if any) will get an answer - shutdown nodes do not receive queries.
To answer the questions:
Under default configurations, databases such as Cassandra and MongoDB are partition tolerant because they do not shutdown nodes to cope with partitions, whereas RDBMS such as MySQL do.
Availability has very little to do with master/slave setup, e.g. Cassandra is masterless and very available because it doesn't really matter which node dies. As for availability in a master/slave setup, there is no reason to stop responding to all queries when master is dead, but you may need to suspend write operations while electing a new one.
A lot of databases now actually have different configurations and depending on the settings you set, it can be either CA, CP, AP, etc but can not achieve all three at the same time. Some databases actually make an effort to support all three but still prioritizes them in a certain way.
For example, MySQL can be CP and CA depending on the configurations. By default, it is CA because it follows a master slave paradigm which data is replicated to the slaves. Partition tolerance is sacrificed in the event that a set of the slaves loses the connection to the master and therefore decides to elect a new master creating two masters with their own set of slaves.
However, MySQL also has another configuration which is a clustered configuration. It prioritizes CP over availability eg. the cluster will shutdown if there are not enough live nodes to serve all the data.
There are probably more configurations for MySQL that makes it satisfy other CAP theorem combinations but overall, I just wanted say that it depends on what your system requires. Sometimes databases are better for one configuration vs another so its best to see what kinds of problems that may also occur in using a certain configuration.
As for implementing the CAP theorem, I would advise taking a further look into different databases and how they implement the priorities for the CAP theorem. There are just too many different ways of implementing them eg. generally, the master slave model is used for CA systems, the hash ring for AP systems, etc.
CAP theorem is problematic and it applies only to distributed database systems. When you have distributed databases then network partition and node crashes can happen. And when network partition happens you must have partition tolerance (the P of your CAP).
So to answer your question number 1) It’s either CP or AP. It can be configured as Will mentioned.
More about why partition tolerance is a must:
https://codahale.com/you-cant-sacrifice-partition-tolerance/
More about problems around CAP theorem:
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
I agree that RDBMS can have all the properties of CAP. I have started studying noSQL DBs and had prior experience with IBM DB2.
Here is how IBM DB2 satisfies all the 3 CAP properties
C : Consistency : Every relational database satisfies this due to the transactional nature of RDBMS.
A : Availability : Availability means that when a query is made for a data that exists, it should be returned. Again, a relational database is designed to do this easily.
P : Partition Tolerance : This is the most interesting one. From DB2 stand point, in the application that I was working on, we had 2 databases spread across different data centres. One was the primary and communicated with the secondary via heartbeats. Each of these primary and secondary databases, had 12 physical instances where data was distributed on the basis of some predefined logic. If the primary goes down, the secondary detects this and takes the place of primary. Since the primary and secondary were always maintained in sync, data remains consistent as well.
This is how I think that RDBMS satisfies all 3 properties of CAP Theorem.
I may be wrong, and open to discussion on this.

Why are RDBMS considered Available (CA) for CAP Theorem

If I understand the CAP Theorem correctly, availability means that the cluster continues to operate even if a node goes down.
I've seen a lot of people (http://blog.nahurst.com/tag/guide) list RDBMS as CA, but I do not understand how RBDMS is available, as if a node goes down, the cluster must go down to maintain consistency.
My only possible answer to this has been that most RDBMS are a single node, so there is no "non-failing" node. But, this seems to be a technicality, not true 'availability' and definitely not high availability.
Thank you.
First of all, let me clarify and state that the consistency in RDBMS is different than consistency in distributed systems. RDBMS (single system) applies consistency to transactional consistency, where as in distributed systems consistency means view from anywhere in the system (read from any node) is consistent. So RDMBS single node cannot be discussed with regards to CAP theorem. It is like comparing apple to orange.
RDBMS with master-slave can be compared to distributed systems. Here RDBMS can be configured to CA/CP or AP. MySQL for example, provides a way to configure the system in a way that if there is a quorum loss (not enough secondary available for commit log replication), the cluster is not available (CP system). MySQL also provides a configuration to allow the cluster to operate as long as master is available (CA system) with the potential of data loss. SQL Server AlwaysOn is an AP system, because commit log replication is asynchronous (even on sync replicas).
So RDBMS can be any of CA, CP or AP in a distributed world.
I believe you are misunderstanding the relation between CAP-Availability and node-UP/DOWN. Availability is about providing an answer to every received query - when a node is down it cannot receive queries, therefore if you bring down parts of or the entire cluster, the CAP-Availability property holds. Although this may sound counter intuitive at first glance, by shutting down nodes you are holding on to CAP-Availability and dropping CAP-Partition tolerance instead. I've recently posted an answer whose examples provide some clarification.
In a nutshell: A partition occurs that isolates node N. If N receives a request it can either: i) answer which grants availability but drops consistency because N is out of sync; ii) do not answer to avoid replying with an out-of-date result, thereby dropping availability because we received a request but issued no reply for it.
Alternatively we can shutdown N as soon as it becomes disconnected from the rest of the cluster which allows us to keep C and A, but drop P, because: i) N will not receive any requests; ii) all received requests will be performed to the fully connected and consistent cluster, hence they will all be answered with consistent values; iii) the cluster is not partition tolerant because it does not tolerate partitions - instead it shutdowns partitioned nodes.
In CAP Theorem P is for Partition tolerance , which is the ability of system to handle partitions(partitions are isolated clusters - due to network failure or any other reason ..).
In a distributed network to handle a partition , system has to pick either Consistency or Availability.
In case of RDBMS there is no chance for partitions (assuming not distributed which is normal case) ,So Those will be always CA.

High Availability In MongoDB

Everybody say that mongoDB is CP in CAP Theorem! But with using master-slave replication, It has high availability too (If a primary fails, the remaining members will automatically try to elect a new primary). My question is, In which situations (and how) It can have AP (with Eventual Consistency)?
Actually, there is a two-part answer:
Sharding level: There's only one authoritative shard per data segment (C), shards work independently (P), if a shard is not available its data isn't available as well (A)
Replica set level: There's only one authoritative master node (C), if required a new master will be selected (P), if there's no primary (during the voting phase which should only last a few seconds, but that's enough) you cannot access the data on that node. If you enable reading from secondaries (eventual consistency), you can read data from secondaries during the voting phase, but still not write new data. Thus it's a CP system.
In general, you don't lose the third characteristic entirely, but trade it for additional latency / overhead or for not having it for a short amount of time.
it is not quite possible to have C,A,P all together, that is the theorem, you cant have them all, you can take only two.
See :
Where does mongodb stand in the CAP theorem?