High Availability In MongoDB - mongodb

Everybody say that mongoDB is CP in CAP Theorem! But with using master-slave replication, It has high availability too (If a primary fails, the remaining members will automatically try to elect a new primary). My question is, In which situations (and how) It can have AP (with Eventual Consistency)?

Actually, there is a two-part answer:
Sharding level: There's only one authoritative shard per data segment (C), shards work independently (P), if a shard is not available its data isn't available as well (A)
Replica set level: There's only one authoritative master node (C), if required a new master will be selected (P), if there's no primary (during the voting phase which should only last a few seconds, but that's enough) you cannot access the data on that node. If you enable reading from secondaries (eventual consistency), you can read data from secondaries during the voting phase, but still not write new data. Thus it's a CP system.
In general, you don't lose the third characteristic entirely, but trade it for additional latency / overhead or for not having it for a short amount of time.

it is not quite possible to have C,A,P all together, that is the theorem, you cant have them all, you can take only two.
See :
Where does mongodb stand in the CAP theorem?

Related

Why isn't RDBMS Partition Tolerant in CAP Theorem and why is it Available?

Two points I don’t understand about RDBMS being CA in CAP Theorem :
1) It says RDBMS is not Partition Tolerant but how is RDBMS any less Partition Tolerant than other technologies like MongoDB or Cassandra? Is there a RDBMS setup where we give up CA to make it AP or CP?
2) How is it CAP-Available? Is it through master-slave setup? As in when the master dies, slave takes over writes?
I’m a novice at DB architecture and CAP theorem so please bear with me.
It is very easy to misunderstand the CAP properties, hence I'm providing some illustrations to make it easier.
Consistency: A query Q will produce the same answer A regardless the node that handles the request. In order to guarantee full consistency we need to ensure that all nodes agree on the same value at all times. Not to be confused with eventual consistency in which the network moves towards having all data consistent but there are periods of time in which it is not.
Availability: If the distributed system receives query Q it will always produce an answer for that query. This should not be confused with "high-availability", this is not about having the capacity to process a higher troughput of queries, it is about not refusing to answer.
Partition Tolerance: The system continues to function despite the existence of a partition. This is not about having mechanisms to "fix" the partition, it is about tolerating the partition, i.e. continuing despite the partition.
Note that the following examples do not cover all possible scenarios. Consider the following caption:
An example for CP:
The system is partition tolerant because its nodes keep accepting requests despite the partition; it is consistent because the only nodes providing answers are those that maintain a connection to the master node that handles all the write requests; it is not available because the nodes in the other partition do not provide an answer to the queries they receive.
Examples for AP:
Either because (respectively) we have the slave nodes replying to requests regardless whether they able to reach master or because the slave nodes in the other partition elect a new master, or because we have a masterless cluster, availability is achieved because all questions are getting an answer - consistency is dropped because both partitions are replying while potentially yielding different states.
Examples for CA:
If we disconnect nodes when a partition occurs, we can ensure that we have at most one partition which ultimately means that the network is not partitioned anymore, or simply there is no service at all. This is the opposite of partition tolerance, because the system is avoiding the partition instead of functioning despite it. Consistency and availability holds in these partially or fully disconnected systems because all working nodes (if any) have the same state and all received queries (if any) will get an answer - shutdown nodes do not receive queries.
To answer the questions:
Under default configurations, databases such as Cassandra and MongoDB are partition tolerant because they do not shutdown nodes to cope with partitions, whereas RDBMS such as MySQL do.
Availability has very little to do with master/slave setup, e.g. Cassandra is masterless and very available because it doesn't really matter which node dies. As for availability in a master/slave setup, there is no reason to stop responding to all queries when master is dead, but you may need to suspend write operations while electing a new one.
A lot of databases now actually have different configurations and depending on the settings you set, it can be either CA, CP, AP, etc but can not achieve all three at the same time. Some databases actually make an effort to support all three but still prioritizes them in a certain way.
For example, MySQL can be CP and CA depending on the configurations. By default, it is CA because it follows a master slave paradigm which data is replicated to the slaves. Partition tolerance is sacrificed in the event that a set of the slaves loses the connection to the master and therefore decides to elect a new master creating two masters with their own set of slaves.
However, MySQL also has another configuration which is a clustered configuration. It prioritizes CP over availability eg. the cluster will shutdown if there are not enough live nodes to serve all the data.
There are probably more configurations for MySQL that makes it satisfy other CAP theorem combinations but overall, I just wanted say that it depends on what your system requires. Sometimes databases are better for one configuration vs another so its best to see what kinds of problems that may also occur in using a certain configuration.
As for implementing the CAP theorem, I would advise taking a further look into different databases and how they implement the priorities for the CAP theorem. There are just too many different ways of implementing them eg. generally, the master slave model is used for CA systems, the hash ring for AP systems, etc.
CAP theorem is problematic and it applies only to distributed database systems. When you have distributed databases then network partition and node crashes can happen. And when network partition happens you must have partition tolerance (the P of your CAP).
So to answer your question number 1) It’s either CP or AP. It can be configured as Will mentioned.
More about why partition tolerance is a must:
https://codahale.com/you-cant-sacrifice-partition-tolerance/
More about problems around CAP theorem:
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
I agree that RDBMS can have all the properties of CAP. I have started studying noSQL DBs and had prior experience with IBM DB2.
Here is how IBM DB2 satisfies all the 3 CAP properties
C : Consistency : Every relational database satisfies this due to the transactional nature of RDBMS.
A : Availability : Availability means that when a query is made for a data that exists, it should be returned. Again, a relational database is designed to do this easily.
P : Partition Tolerance : This is the most interesting one. From DB2 stand point, in the application that I was working on, we had 2 databases spread across different data centres. One was the primary and communicated with the secondary via heartbeats. Each of these primary and secondary databases, had 12 physical instances where data was distributed on the basis of some predefined logic. If the primary goes down, the secondary detects this and takes the place of primary. Since the primary and secondary were always maintained in sync, data remains consistent as well.
This is how I think that RDBMS satisfies all 3 properties of CAP Theorem.
I may be wrong, and open to discussion on this.

Can someone give me detailed technical reasons why writing to a secondary in MongoDB replica set is not allowed

I know we can't write to a secondary in MongoDB. But I can't find any technical reason why. In my case, I don't really care if there is a slight delay but write to a secondary might be faster. Please provide some reference if you can. Thanks!!
The reason why you can not write to a secondary is the way replication works:
Secondaries connect to a special collection on the primary, called oplog. This oplog contains operations which were run through the query optimizer. Basically, the oplog is a capped collection, and the secondaries use a tailable cursor to access it's entries and processes it from the oldest to the newest.
When a election takes place because the primary goes down / steps down, the secondary with the most recent oplog entry is elected primary. The secondaries connect to the new primary, query for the oplog entries they haven't processed yet and the cluster is in sync.
This procedure is pretty straight forward. Now imagine one could write to a secondary. All nodes in the cluster would have to have a tailable cursor on all other nodes of the cluster, and maintaining a consistent state in case of one machine failing becomes a very complicated and in case of a failure even race condition dependent thing. Effectively, there could be no guarantee even for eventual consistency any more. It would be a more or less a gamble.
That being said: A replica set is not for load balancing. A replica sets purpose is to enhance the availability and durability of the data. Because reading from a secondary is a non-risky thing, MongoDB made it possible, according to their dogma of offering the maximum of possible features without compromising scalability (which would be severely hampered if one could write to secondaries).
But MongoDB does provide a load balancing feature: sharding. Choosing the right shard key, you can distribute read and write load over (almost) as many shards as you want. Not to mention that you can provide a lot more of the precious RAM for a reasonable price when sharding.
There is a one liner answer:
Multi-master replication is a hairball.
If you was allowed to write to secondaries MongoDB would have to use milti-master replication to ge this working: http://en.wikipedia.org/wiki/Multi-master_replication where essentially evey node copies to each other the OPs (operations) they have received and somehow do it without losing data.
This form of replication has many obsticles to overcome.
One would be throughput; remember that OPs need to transfer across the entire network so it is possible you might actually lose throughput while adding consistentcy problems. So getting better throughput would be a problem. It is much having a secondary, taking all of the primaries OPs and then its own for replication outbound and then asking it to do yet another job.
Adding consistentcy over a distributed set like this would also be hazardous, one main question that bugs MongoDB when asking if a member is down or is: "Is it really down or just unavailable?". It is almost impossible to ensure true consistentcy in a distributed set like this, at the very least tricky.
Those are just two problems immediately.
Essentially, to sum up, MongoDB does not yet possess mlti-master replication. It could in the future but I would not be jumping for joy if it does, I will most likely ignore such a feature, normal replication and sharding in both ACID and non-ACID databases causes enough blood pressure.

Why is MongoDB supposed to Consistent & Partition tolerant but not Available [duplicate]

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability

mongo - read Preference design strategy

I have an application for which I am tasked with designing a mongo backed data storage.
The application goals are to provide the latest data ( no stale data ) with the fastest load times.
The data size is in the order of a few millions with the application being write heavy.
In choosing what the read strategy is given a 3-node replica set ( 1 primary, 1 secondary, 1 arbiter ), I came across two different strategies to determine where to source the reads from -
Read from the secondary to reduce load on primary. With the writeConcern = REPLICA_SAFE, thus ensuring the writes are done on both primary and the secondary. Set the read preference. to secondaryPreferred.
Always read from primary. but ensure the data is in primary before reading. So set writeConcern= SAFE . The read preference is default - primaryPreferred .
What are the things to be considered before choosing one of the options.
According to the documentation REPLICA_SAFE is a deprecated term and should be replaced with REPLICA_ACKNOWLEDGED. The other problem here is that the w value here appears to be 2 from this constant.
This is a problem for your configuration, as you have your Primary and only one Secondary, combined with an arbiter. In the event of a node going down, or being otherwise unreachable, with the level set as this it is looking to acknowledge all writes from 2 nodes where there will not be 2 nodes available. You can leave write operations hanging in this way.
The better case for your configuration would be MAJORITY, as no matter the number of nodes it will ensure writes to the Primary and the "majority" of the secondaries. But in your case any write concern condition involving more than the PRIMARY will block on all writes, if one of your nodes is down or unavailable, as you would have to have at least two more secondary nodes available so that there would still be a "majority" of nodes to acknowledge the write. Or drop the ARBITER and have two SECONDARY nodes.
So you will have to stick to the default w=1 where all writes are acknowledged to the PRIMARY unless you can deal with writes failing when your one SECONDARY goes down.
You can set the read preference to secondaryPreferred as long as you accept that you can ""possibly" be reading stale or not the latest representation of your data as the only real guarantee is of a write to the Primary node. The general replication considerations remain, in that the nodes should be somewhat equal in processing capability or this can lead to lag or general performance degradation as a result of your query operations.
Remember that replication is implemented for redundancy and is not a system for improving performance. If you are looking for performance then perhaps look into scaling up your system hardware or implement sharding to distribute the load.

Where does mongodb stand in the CAP theorem?

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability