Why is MongoDB supposed to Consistent & Partition tolerant but not Available [duplicate]

Why is MongoDB supposed to Consistent & Partition tolerant but not Available [duplicate] - mongodb

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?

MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets

I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.

As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.

Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.

MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.

Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.

I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)

Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability

Related

Can Cassandra or ScyllaDB give incomplete data while reading with PySpark if either clusters are left un-repaired forever?

I use both Cassandra and ScyllaDB 3-node clusters and use PySpark to read data. I was wondering if any of them are not repaired forever, is there any challenge while reading data from either if there are inconsistencies in nodes. Will the correct data be read and if yes, then why do we need to repair them?

Yes you can get incorrect data if reapir is not done. It also depends on with what consistency you are reading or writing. Generally in production systems writes are done with (Local_one/Local_quorum) and read with Local_quorum.
If you are writing with weak consistency level, then repair becomes important as some of the nodes might not have got the mutations and while reading those nodes may get selected.
For example if you write with consistency level ONE on a table TABLE1 with a replication of 3. Now it may happen your write was written to NodeA only and NodeB and NodeC might have missed the mutation. Now if you are reading with Consistency level LOCAL_QUORUM, it may happen that NodeB and 'NodeC' get selected and they do not return the written data.
Repair is an important maintenance task for Cassandra which should be done periodically and continuously to keep data in healthy state.

As others have noted in other answers, different consistency levels make repair more or less important for different reasons. So I'll focus on the consistency level that you said in a comment you are using: LOCAL_ONE for reading and LOCAL_QUORUM for writing:
Successfully writing with LOCAL_QUORUM only guarantees that two replicas have been written. If the third replica is temporarily down, and will later come up - at that point one third of the read requests for this data, reads done from only one node (this is what LOCAL_ONE means) will miss the new data! Moreover, there isn't even a guarantee of so-called monotonic consistency - you can get new data in one read (from one node), and the old data in a later read (from another node).
However, it isn't completely accurate that only a repair can fix this problem. Another feature - enabled by default on both Cassandra and Scylla - is called Hinted Handoff - where when a node is down for relatively short time (up to three hours, but also depending on the amount of traffic in that period), other nodes which tried to send it updates remember those updates - and retry the send when the dead node comes back up. If you are faced only with such relatively short downtimes, repair isn't necessary and Hinted Handoff is actually enough.
That being said, Hinted Handoff isn't guaranteed perfect and might miss some inconsistencies. E.g., the node wishing to save a hint might itself be rebooted before it managed to save the hint, or replaced after saving it. So this mechanism isn't completely foolproof.
By the way, there another thing you need to be aware of: If you ever intend to do a repair (e.g., perhaps after some node was down for too long for Hinted Handoff to have worked, or perhaps because a QUORUM read causes a read repair), you must do it at least once every gc_grace_seconds (this defaults to 10 days).
The reason for this statement is the risk of data resurrection by repair which is too infrequent. The thing is, after gc_grace_seconds, the tombstones marking deleted items are removed forever ("garbage collected"). At that point, if you do a repair and one of the nodes happens to have an old version of this data (prior to the delete), the old data will be "resurrected" - copied to all replicas.

In addition to Manish's great answer, I'll just add that read operations run consistency levels higher than *_ONE have a (small...10% default) chance to invoke a read repair. I have seen that applications running at a higher consistency level for reads, will have less issues with inconsistent replicas.
Although, writing at *_QUORUM should ensure that the majority (quorum) of replicas are indeed consistent. Once it's written successfully, data should not "go bad" over time.
That all being said, running periodic (weekly) repairs is a good idea. I highly recommend using Cassandra Reaper to manage repairs, especially if you have multiple clusters.

Why MongoDB is Consistent not available and Cassandra is Available not consistent?

Mongo
From this resource I understand why mongo is not A(Highly Available) based on below statement
MongoDB supports a “single master” model. This means you have a master
node and a number of slave nodes. In case the master goes down, one of
the slaves is elected as master. This process happens automatically
but it takes time, usually 10-40 seconds. During this time of new
leader election, your replica set is down and cannot take writes
Is it for the same reason Mongo is said to be Consistent(as write did not happen so returning the latest data in system ) but not Available(not available for writes) ?
Till re-election happens and write operation is in pending, can slave return perform the read operation ? Also does user re-initiate the write operation again once master is selected ?
But i do not understand from another angle why Mongo is highly consistent
As said on Where does mongodb stand in the CAP theorem?,
Mongo is consistent when all reads go to the primary by default.
But that is not true. If under Master/slave model , all reads will go to primary what is the use of slaves then ? It further says If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results. It means mongo may not be be
consistent with master/slaves(provided i do not configure write to all nodes before return). It does not makes sense to me to say mongo is consistent if all
read and writes go to primary. In that case every other DB also(like cassandra) will be consistent . Is n't it ?
Cassandra
From this resource I understand why Cassandra is A(Highly Available ) based on below statement
Cassandra supports a “multiple master” model. The loss of a single
node does not affect the ability of the cluster to take writes – so
you can achieve 100% uptime for writes
But I do not understand why cassandra is not Consistent ? Is it because node not available for write(as coordinated node is not able to connect) is available for read which can return stale data ?

Go through: MongoDB, Cassandra, and RDBMS in CAP, for better understanding of the topic.
A brief definition of Consistency and availability.
Consistency simply means, when you write a piece of data in a system/distributed system, the same data you should get when you read it from any node of the system.
Availability means, the system should always be available for read/write operation.
Note: Most systems are not, only available or only consistent, they always offer a bit of both
With the above definition let's see where MongoDB and Cassandra fall in CAP.
MongoDB
As you said MongoDB is highly consistent when reads and write go to the same node(the default case). Further, you can choose in MongoDB to read from other secondary nodes instead of reading from only leader/primary.
Now, when you try to read data from secondary, your consistency will completely depend on, how you want to read data:
You could ask data which is up to maximum, say 5 seconds stale or,
You could just say, return data from majority of nodes for your select statement.
Same way when you write from your client into Mongo leader, you can say, a write is successful if the data is replicated to or stored on majority of servers.
Clearly, from above, we can say MongoDb can be highly consistent or eventually consistent based on how you read/write your data.
Now, what about availability? MongoDB is mostly always available, but, the only time when the leader is down, MongoDB can't accept writes, until it figures out the new leader. Hence, not highly available
So, MongoDB is categorized under CP.
What about Cassandra?
In Cassandra, there is no leader and any nodes can accept write, so the Cassandra cluster is always available for writes and reads even if some nodes go down.
What about consistency in Cassandra?
Same as MongoDB Cassandra can be eventually consistent or highly consistent based on how you read/write data.
You can give consistency levels in your read/write operations, For example:
read/write data from one node
read/write data from majority/quorum of nodes and more
Let's say you give a consistency level of one in your read/write operation. So, your write is successful as soon as data is written to one replica. Now, if your read request happens to go to the other replica where the data is not updated yet(could be due to high network latency or any other reason), you will end up reading the old data.
So, Cassandra is highly available but has configurable consistency levels and hence not always consistent.
In conclusion, in their default behavior, MongoDB falls under CP and Cassandra in AP.

Consistency in the CAP paradigm also includes "eventual consistency" which MongoDB supports. In a contrast to ACID systems, the read in CAP systems does not guarantee a safe return.
In simple words, this means that your Master could have an updated value, but if you do read from Slave, it does not necessarily return the updated value, and that it's okay to no have this updated value by design.
The concept of eventual consistency is explained in an excellent answer here.
By architecture, Cassandra is supposed to be consistent; it offers a special implementation of eventual consistency called the 'tunable consistency' which would meant that the client application may choose the method of handling this- it even offers multi data centre consistency support at low levels!
Most issues from row wise inconsistency in Cassandra comes from the fact that Cassandra uses client timestamps to determine which value is the most recent, and not the server side ones, which may be tad bit confusing to understand at first.
I hope this helps!

You have only to understand the "point-in-time": As you only write to mongodb master, even if slave is not updated, it is consistent, as it has all the data generated util the sync moment.
That is not true for cassandra. As cassandra uses a master-less model, there's no garantee that other nodes has all the data. At a certain time, a node can have certain recent data, and not having older data from nodes not yet synced. Cassandra will only be consistent if you stop write to all nodes and put them online. As soon the sync finished you have a consistent data.

Can someone give me detailed technical reasons why writing to a secondary in MongoDB replica set is not allowed

I know we can't write to a secondary in MongoDB. But I can't find any technical reason why. In my case, I don't really care if there is a slight delay but write to a secondary might be faster. Please provide some reference if you can. Thanks!!

The reason why you can not write to a secondary is the way replication works:
Secondaries connect to a special collection on the primary, called oplog. This oplog contains operations which were run through the query optimizer. Basically, the oplog is a capped collection, and the secondaries use a tailable cursor to access it's entries and processes it from the oldest to the newest.
When a election takes place because the primary goes down / steps down, the secondary with the most recent oplog entry is elected primary. The secondaries connect to the new primary, query for the oplog entries they haven't processed yet and the cluster is in sync.
This procedure is pretty straight forward. Now imagine one could write to a secondary. All nodes in the cluster would have to have a tailable cursor on all other nodes of the cluster, and maintaining a consistent state in case of one machine failing becomes a very complicated and in case of a failure even race condition dependent thing. Effectively, there could be no guarantee even for eventual consistency any more. It would be a more or less a gamble.
That being said: A replica set is not for load balancing. A replica sets purpose is to enhance the availability and durability of the data. Because reading from a secondary is a non-risky thing, MongoDB made it possible, according to their dogma of offering the maximum of possible features without compromising scalability (which would be severely hampered if one could write to secondaries).
But MongoDB does provide a load balancing feature: sharding. Choosing the right shard key, you can distribute read and write load over (almost) as many shards as you want. Not to mention that you can provide a lot more of the precious RAM for a reasonable price when sharding.

There is a one liner answer:
Multi-master replication is a hairball.
If you was allowed to write to secondaries MongoDB would have to use milti-master replication to ge this working: http://en.wikipedia.org/wiki/Multi-master_replication where essentially evey node copies to each other the OPs (operations) they have received and somehow do it without losing data.
This form of replication has many obsticles to overcome.
One would be throughput; remember that OPs need to transfer across the entire network so it is possible you might actually lose throughput while adding consistentcy problems. So getting better throughput would be a problem. It is much having a secondary, taking all of the primaries OPs and then its own for replication outbound and then asking it to do yet another job.
Adding consistentcy over a distributed set like this would also be hazardous, one main question that bugs MongoDB when asking if a member is down or is: "Is it really down or just unavailable?". It is almost impossible to ensure true consistentcy in a distributed set like this, at the very least tricky.
Those are just two problems immediately.
Essentially, to sum up, MongoDB does not yet possess mlti-master replication. It could in the future but I would not be jumping for joy if it does, I will most likely ignore such a feature, normal replication and sharding in both ACID and non-ACID databases causes enough blood pressure.

Where does mongodb stand in the CAP theorem?

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?

MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets

As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.

Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.

MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.

Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.

I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)

Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability

How safe is MongoDB's safe mode on inserts?

I am working on a project which has some important data in it. This means we cannot to lose any of it if the light or server goes down. We are using MongoDB for the database. I'd like to be sure that my data is in the database after the insert and rollback the whole batch if one element was not inserted. I know it is the philosophy behind Mongo that we do not need transactions but how can I make sure that my data is really safely stored after insert rather than sent to some "black hole".
Should I make a search?
Should I use some specific mongoDB commands?
Should I use sharding even if one server is enough for satisfying
the speed and by the way it doesn't guarantee anything if the light
goes down?
What is the best solution?

Your best bet is to use Write Concerns - these allow you to tell MongoDB how important a piece of data is. The quickest Write Concern is also the least safe - the data is not flushed to disk until the next scheduled flush. The safest will confirm that the data has been written to disk on a number of machines before returning.
The write concern you are looking for is FSYNC_SAFE (at least that is what it is called from the point of view of the Java driver) or REPLICAS_SAFE which confirms that your data has been replicated.
Bear in mind that MongoDB does not have transactions in the traditional sense - your rollback will have to be rolled by hand as you can't tell the Mongo database to do this for you.
The other thing you need to do is either use the relatively new --journal option (which uses a Write Ahead Log), or use replica sets to share your data across many machines in order to maximise data integrity in the event of a crash/power loss.
Sharding is not so much a protection against hardware failure as a method for sharing the load when dealing with particularly large datasets - sharding shouldn't be confused with replica sets which is a way of writing data to more than one disk on more than one machine.
Therefore, if your data is valuable enough, you should definitely be using replica sets, perhaps even siting slaves in other data centres/availability zones/racks/etc in order to provide the resilience you require.
There is/will be (can't remember offhand whether this has been implemented yet) a way to specify the priority of individual nodes in a replica set such that if the master goes down the new master that is elected is one in the same data centre if such a machine is available (ie to stop a slave on the other side of the country from becoming master unless it really is the only other option).

I received a really nice answer from a person called GVP on google groups. I will quote it(basically it adds up to Rich's answer):
I'd like to be sure that my data is in the database after the
insert and rollback the whole batch if one element was not inserted.
This is a complex topic and there are several trade-offs you have to
consider here.
Should I use sharding?
Sharding is for scaling writes. For data safety, you want to look a
replica sets.
Should I use some specific mongoDB commands?
First thing to consider is "safe" mode or "getLastError()" as
indicated by Andreas. If you issue a "safe" write, you know that the
database has received the insert and applied the write. However,
MongoDB only flushes to disk every 60 seconds, so the server can fail
without the data on disk.
Second thing to consider is "journaling"
(v1.8+). With journaling turned on, data is flushed to the journal
every 100ms. So you have a smaller window of time before failure. The
drivers have an "fsync" option (check that name) that goes one step
further than "safe", it waits for acknowledgement that the data has
be flushed to the disk (i.e. the journal file). However, this only
covers one server. What happens if the hard drive on the server just
dies? Well you need a second copy.
Third thing to consider is
replication. The drivers support a "W" parameter that says "replicate
this data to N nodes" before returning. If the write does not reach
"N" nodes before a certain timeout, then the write fails (exception
is thrown). However, you have to configure "W" correctly based on the
number of nodes in your replica set. Again, because a hard drive
could fail, even with journaling, you'll want to look at replication.
Then there's replication across data centers which is too long to get
into here. The last thing to consider is your requirement to "roll
back". From my understanding, MongoDB does not have this "roll back"
capacity. If you're doing a batch insert the best you'll get is an
indication of which elements failed.
Here's a link to the PHP driver on this one: http://it.php.net/manual/en/mongocollection.batchinsert.php You'll have to check the details on replication and the W parameter. I believe the same limitations apply here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse