MongoDB: Write concern w:[number>1] - why? - mongodb

I'm trying to understand the advantages of using write concern where is greater than 1.
I understand the use cases for w:1, and w:primary. I'm trying to understand why someone would use other values. Let's use a 6-node (+arbiter) replica set as an example.
If it's guaranteed writes with {w:majority, j:true} will survive the
failure of the primary, what the advantage of the w:5?
Does w:6 help to achieve linearizable access to secondaries at the
expense of availability as a failure of any node will prevent writes?
This is not documented as such.
Why would someone ever use w:2 or w:3? It doesn't guarantee the
write will survive a failure of the primary.

Write concern is a mechanism for tuning availability, durability and performance.
In a w:1 environment, you essentially expect to lose data if the primary experiences any issue whatsoever.
In a w:2 environment, you still may lose data but you expect less loss because the data will be replicated. If you lose two nodes, and you are unlucky with where the second write went, you can lose the data.
In a w:3 environment, there is still potential for data loss but it is less than in the w:2 environment.
w:4 is majority write, a sensible default for most applications.
w:5 provides one "unnecessary" write. This means your writes will take longer than they would with w:majority but if you have elections, secondary nodes will generally be more up-to-date with the primary, so you would reduce election times.
w:6 writes to every node. If you are doing secondary reads and your nodes are geographically distributed, this is a way of getting the data to all of the nodes quicker at the expense of potential write unavailability and obviously longer writes.
Does w:6 help to achieve linearizable access to secondaries at the expense of availability as a failure of any node will prevent writes?
No, this is not one of the benefits you get from w=# of nodes.

Related

Does paxos provide true linearizable consistency or not?

I think I might be confusing concepts here, but it seems to me like paxos would provide linearizable consistency for systems that implement it.
I know Cassandra uses it. I'm not 100% clear on how but assuming a leader is elected and that single leader does all the writes then communication is synchronous and real-time linearizability is achieved right?
But consensus algorithms like paxos are generally considered partially synchronous because there is a quorum (not 100% of node communication)- does this also mean it's not truly linearizable as well?
maybe because there is only a quorum a node could fall out of sync and that would break linearization?
A linearizable system does not need to be synchronous. Linearizability is a safety property: it says "nothing bad happens" but it doesn't affect linearizability if nothing good happens either. Any reads or writes that do not return (or that return an error) can be ignored when checking for linearizability. This means it's perfectly possible for a system to be linearizable even if one or more of the nodes are faulty or partitioned or running slowly.
Paxos is commonly used to implement a replicated state machine: a system that executes a sequence of operations on multiple nodes at once. Since the operations are deterministic and the nodes all agree on the operations to run and the sequence in which to run them, the nodes all converge to the same state (eventually).
You can implement a linearizable system using Paxos by having the operations in the sequence be writes and reads using the fact that the operations are placed in a totally-ordered sequence (i.e. linearized) by the Paxos protocol.
It's important to put the reads in the sequence as well as the writes. Imagine instead you only used Paxos to agree on the writes, and served reads directly from a node's local state. If the node serving the reads is partitioned from the other nodes then it would serve stale reads, violating linearizability. Each read must involve a quorum of nodes to ensure that the returned value is fresh, which means (effectively) putting the read into the sequence alongside the writes.
(There's some tricks you can play to make reads a bit more efficient than writes, given that reads commute with each other and don't need to be persisted to disk, but you can't escape the need to contact a quorum of nodes for both read and write operations)

Can someone give me detailed technical reasons why writing to a secondary in MongoDB replica set is not allowed

I know we can't write to a secondary in MongoDB. But I can't find any technical reason why. In my case, I don't really care if there is a slight delay but write to a secondary might be faster. Please provide some reference if you can. Thanks!!
The reason why you can not write to a secondary is the way replication works:
Secondaries connect to a special collection on the primary, called oplog. This oplog contains operations which were run through the query optimizer. Basically, the oplog is a capped collection, and the secondaries use a tailable cursor to access it's entries and processes it from the oldest to the newest.
When a election takes place because the primary goes down / steps down, the secondary with the most recent oplog entry is elected primary. The secondaries connect to the new primary, query for the oplog entries they haven't processed yet and the cluster is in sync.
This procedure is pretty straight forward. Now imagine one could write to a secondary. All nodes in the cluster would have to have a tailable cursor on all other nodes of the cluster, and maintaining a consistent state in case of one machine failing becomes a very complicated and in case of a failure even race condition dependent thing. Effectively, there could be no guarantee even for eventual consistency any more. It would be a more or less a gamble.
That being said: A replica set is not for load balancing. A replica sets purpose is to enhance the availability and durability of the data. Because reading from a secondary is a non-risky thing, MongoDB made it possible, according to their dogma of offering the maximum of possible features without compromising scalability (which would be severely hampered if one could write to secondaries).
But MongoDB does provide a load balancing feature: sharding. Choosing the right shard key, you can distribute read and write load over (almost) as many shards as you want. Not to mention that you can provide a lot more of the precious RAM for a reasonable price when sharding.
There is a one liner answer:
Multi-master replication is a hairball.
If you was allowed to write to secondaries MongoDB would have to use milti-master replication to ge this working: http://en.wikipedia.org/wiki/Multi-master_replication where essentially evey node copies to each other the OPs (operations) they have received and somehow do it without losing data.
This form of replication has many obsticles to overcome.
One would be throughput; remember that OPs need to transfer across the entire network so it is possible you might actually lose throughput while adding consistentcy problems. So getting better throughput would be a problem. It is much having a secondary, taking all of the primaries OPs and then its own for replication outbound and then asking it to do yet another job.
Adding consistentcy over a distributed set like this would also be hazardous, one main question that bugs MongoDB when asking if a member is down or is: "Is it really down or just unavailable?". It is almost impossible to ensure true consistentcy in a distributed set like this, at the very least tricky.
Those are just two problems immediately.
Essentially, to sum up, MongoDB does not yet possess mlti-master replication. It could in the future but I would not be jumping for joy if it does, I will most likely ignore such a feature, normal replication and sharding in both ACID and non-ACID databases causes enough blood pressure.

Why is MongoDB supposed to Consistent & Partition tolerant but not Available [duplicate]

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability

MongoDB writeConcern for production

My question may sound too general, but I'm ready to give any missing data.
We make something like a social network. In order to make read performance better and to ease the life of master instance, we've set
readPreference=secondaryPreferred
in our replicaSet. But with this, there's no guarantee that the data is written to secondary instances before you read from there, so we had to set
w=3
option.
So far, everything seems to be working but measurements on my local replicaSet show the following insert statistics.
Inserting 300 objects:
w=1 - 0.10s
w=3 - 1.31s
Insertion 5000 objects:
w=1 - 0.6s
w=3 - 14.6s
The question is, is this difference expected, or I'm doing something wrong?
The difference in performance is expected because w=3 means that you want to wait for acknowledgement that data was successfully replicated to at least two of your secondaries in addition to the acknowledgement from your primary (w=1).
For clarity, w = 1 simply means that you want an acknowledgement from the primary that an operation was completed. Any errors such as duplicate key errors or network errors that occur would be reported back as part of the acknowledgement if occurred.
http://docs.mongodb.org/manual/reference/write-concern/
Refer to the link above, and you can see there are lower write concerns that let you trade safety for lower latency.
If you want higher level of durability or safety, then you might use j=1 to wait for an acknowledgement that your operation was written to the journal (allowing recovery from a failure). w > N increases safety by waiting for acknowledgement from > N replica members to ensure that your operation was successfully replicated to other members. So to be clear, w > 1 isn't necessary to instruct the driver to write to the replicas. If you decided to use w=N, be aware that you can get yourself in a bad situation if replica set members fail and fall below N. w = majority is a more flexible option.
Lastly, you may want to re-evaluate why you're reading from the secondaries. Secondaries are eventually consistent as MongoDB uses asynchronous replication. If you're expecting consistent reads, then it makes more sense to read from the primary. If your reason to read from the secondary is for scaling, you should consider sharding as this is the primary mechanism for scale-out. Distributing load on secondaries rarely improve scalability. Operations are replicated over to replicas, so you're not gaining much from a lower write load. Sometimes it makes sense to for distributing different types of workloads (may lead to better memory utilization). For instance, running a MR job on a secondary might make sense. Replica sets are primarily for high availability-- fault tolerance providing automatic fail-overs and network partition issues.

Where does mongodb stand in the CAP theorem?

Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability