My question may sound too general, but I'm ready to give any missing data.
We make something like a social network. In order to make read performance better and to ease the life of master instance, we've set
readPreference=secondaryPreferred
in our replicaSet. But with this, there's no guarantee that the data is written to secondary instances before you read from there, so we had to set
w=3
option.
So far, everything seems to be working but measurements on my local replicaSet show the following insert statistics.
Inserting 300 objects:
w=1 - 0.10s
w=3 - 1.31s
Insertion 5000 objects:
w=1 - 0.6s
w=3 - 14.6s
The question is, is this difference expected, or I'm doing something wrong?
The difference in performance is expected because w=3 means that you want to wait for acknowledgement that data was successfully replicated to at least two of your secondaries in addition to the acknowledgement from your primary (w=1).
For clarity, w = 1 simply means that you want an acknowledgement from the primary that an operation was completed. Any errors such as duplicate key errors or network errors that occur would be reported back as part of the acknowledgement if occurred.
http://docs.mongodb.org/manual/reference/write-concern/
Refer to the link above, and you can see there are lower write concerns that let you trade safety for lower latency.
If you want higher level of durability or safety, then you might use j=1 to wait for an acknowledgement that your operation was written to the journal (allowing recovery from a failure). w > N increases safety by waiting for acknowledgement from > N replica members to ensure that your operation was successfully replicated to other members. So to be clear, w > 1 isn't necessary to instruct the driver to write to the replicas. If you decided to use w=N, be aware that you can get yourself in a bad situation if replica set members fail and fall below N. w = majority is a more flexible option.
Lastly, you may want to re-evaluate why you're reading from the secondaries. Secondaries are eventually consistent as MongoDB uses asynchronous replication. If you're expecting consistent reads, then it makes more sense to read from the primary. If your reason to read from the secondary is for scaling, you should consider sharding as this is the primary mechanism for scale-out. Distributing load on secondaries rarely improve scalability. Operations are replicated over to replicas, so you're not gaining much from a lower write load. Sometimes it makes sense to for distributing different types of workloads (may lead to better memory utilization). For instance, running a MR job on a secondary might make sense. Replica sets are primarily for high availability-- fault tolerance providing automatic fail-overs and network partition issues.
Related
I'm trying to understand the advantages of using write concern where is greater than 1.
I understand the use cases for w:1, and w:primary. I'm trying to understand why someone would use other values. Let's use a 6-node (+arbiter) replica set as an example.
If it's guaranteed writes with {w:majority, j:true} will survive the
failure of the primary, what the advantage of the w:5?
Does w:6 help to achieve linearizable access to secondaries at the
expense of availability as a failure of any node will prevent writes?
This is not documented as such.
Why would someone ever use w:2 or w:3? It doesn't guarantee the
write will survive a failure of the primary.
Write concern is a mechanism for tuning availability, durability and performance.
In a w:1 environment, you essentially expect to lose data if the primary experiences any issue whatsoever.
In a w:2 environment, you still may lose data but you expect less loss because the data will be replicated. If you lose two nodes, and you are unlucky with where the second write went, you can lose the data.
In a w:3 environment, there is still potential for data loss but it is less than in the w:2 environment.
w:4 is majority write, a sensible default for most applications.
w:5 provides one "unnecessary" write. This means your writes will take longer than they would with w:majority but if you have elections, secondary nodes will generally be more up-to-date with the primary, so you would reduce election times.
w:6 writes to every node. If you are doing secondary reads and your nodes are geographically distributed, this is a way of getting the data to all of the nodes quicker at the expense of potential write unavailability and obviously longer writes.
Does w:6 help to achieve linearizable access to secondaries at the expense of availability as a failure of any node will prevent writes?
No, this is not one of the benefits you get from w=# of nodes.
In general, MongoDB will replicate from a Primary to Secondaries asynchronously, based on number of write operations, time and other factors by shipping oplog from primary to secondaries.
When describing WriteConcern options, MongoDB documentation states "...primary waits until the required number of secondaries acknowledge the write before returning write concern acknowledgment". This seems to suggest that a WriteConcern other than "w:1" would replicate to at least some of the members of the replica set in a blocking manner, potentially avoiding log shipping.
The basic question I'm trying to answer is this: if every write is using WriteCocnern of "majority", would MongoDB ever have to use log shipment? In other words, is using WriteCocnern of "majority" also controls replication timing?
I would like to better understand how MongoDB handles WriteConcern of "majority". A few obvious options:
Primary sends write requests to every Secondary, and blocks the thread until majority respond with acknowledgment
or
Primary pre-selects Secondaries first and sends requests to only those secondaries, blocking the thread until all chosen secondaries respond with acknowledgment
or
Something much smarter than either of these options
If Option 1 is used, in most cases (assuming equidistant placement of secondaries) all secondaries will have received the write operation by the time Write completes, and there's high probability (although not a guarantee) all secondaries will have applied it. If true, this behavior enables use cases where writes need to be reflected on Secondaries quicker than typical asynchronous replication process.
Obviously WriteConcern of "majority" will incur performance penalty, but this may be acceptable for specific use cases where read operations may target Secondaries (e.g. ReadPreference of "nearest") and desire more recent data.
if every write is using WriteConcern of "majority", would MongoDB ever have to use log shipment?
Replication in MongoDB uses what is termed as the oplog. This is a record of all operations on the primary (the only node that accept writes).
Instead of pushing the oplog into the secondaries, the secondaries long-pull on the oplog of the primary. If replication chaining is allowed (the default), then a secondary can also pull the oplog from another secondary. So scenario 1 & 2 you posted are not the reality with MongoDB replication as of MongoDB 4.0.
The details of the replication process is described in MongoDB Github wiki page: Replication Internals.
To quote the relevant parts regarding your question:
If a command includes a write concern, the command will just block in its own thread until the oplog entries it generates have been replicated to the requested number of nodes. The primary keeps track of how up-to-date the secondaries are to know when to return. A write concern can specify a number of nodes to wait for, or majority.
In other words, the secondaries continually report back to the primary how far along it has applied the oplog into its own dataset. Since the primary knows the timestamp that the write took place, once a secondary has applied that timestamp, it can tell that the write has propagated to that secondary. To satisfy the write concern, the primary simply waits until a determined number of secondaries have applied the write timestamp.
Note that only the thread specifying the write concern is waiting for this acknowledgment. All other threads are not blocked due to this waiting at all.
Regarding to you other question:
Obviously WriteConcern of "majority" will incur performance penalty, but this may be acceptable for specific use cases where read operations may target Secondaries (e.g. ReadPreference of "nearest") and desire more recent data.
To achieve what you described, you need a combination of read and write concerns. See
Causal Consistency and Read and Write Concerns for more details on this subject.
Write majority is typically used for:
Ensuring that the write will not be rolled back in the event of the primary failure.
Ensuring that the application is not writing so fast that the provisioned hardware of the replica set cannot cope with the traffic; i.e. it can act as a backpressure mechanism.
In combination with read concern, provide the client with differing levels of consistency guarantees.
These points assume that the write majority was acknowledged and the acknowledgment was received by the client. There are multiple different failure scenario that are possible (as expected with a distributed system that needs to cope with unreliable network), but those are beyond the scope of this discussion.
I know we can't write to a secondary in MongoDB. But I can't find any technical reason why. In my case, I don't really care if there is a slight delay but write to a secondary might be faster. Please provide some reference if you can. Thanks!!
The reason why you can not write to a secondary is the way replication works:
Secondaries connect to a special collection on the primary, called oplog. This oplog contains operations which were run through the query optimizer. Basically, the oplog is a capped collection, and the secondaries use a tailable cursor to access it's entries and processes it from the oldest to the newest.
When a election takes place because the primary goes down / steps down, the secondary with the most recent oplog entry is elected primary. The secondaries connect to the new primary, query for the oplog entries they haven't processed yet and the cluster is in sync.
This procedure is pretty straight forward. Now imagine one could write to a secondary. All nodes in the cluster would have to have a tailable cursor on all other nodes of the cluster, and maintaining a consistent state in case of one machine failing becomes a very complicated and in case of a failure even race condition dependent thing. Effectively, there could be no guarantee even for eventual consistency any more. It would be a more or less a gamble.
That being said: A replica set is not for load balancing. A replica sets purpose is to enhance the availability and durability of the data. Because reading from a secondary is a non-risky thing, MongoDB made it possible, according to their dogma of offering the maximum of possible features without compromising scalability (which would be severely hampered if one could write to secondaries).
But MongoDB does provide a load balancing feature: sharding. Choosing the right shard key, you can distribute read and write load over (almost) as many shards as you want. Not to mention that you can provide a lot more of the precious RAM for a reasonable price when sharding.
There is a one liner answer:
Multi-master replication is a hairball.
If you was allowed to write to secondaries MongoDB would have to use milti-master replication to ge this working: http://en.wikipedia.org/wiki/Multi-master_replication where essentially evey node copies to each other the OPs (operations) they have received and somehow do it without losing data.
This form of replication has many obsticles to overcome.
One would be throughput; remember that OPs need to transfer across the entire network so it is possible you might actually lose throughput while adding consistentcy problems. So getting better throughput would be a problem. It is much having a secondary, taking all of the primaries OPs and then its own for replication outbound and then asking it to do yet another job.
Adding consistentcy over a distributed set like this would also be hazardous, one main question that bugs MongoDB when asking if a member is down or is: "Is it really down or just unavailable?". It is almost impossible to ensure true consistentcy in a distributed set like this, at the very least tricky.
Those are just two problems immediately.
Essentially, to sum up, MongoDB does not yet possess mlti-master replication. It could in the future but I would not be jumping for joy if it does, I will most likely ignore such a feature, normal replication and sharding in both ACID and non-ACID databases causes enough blood pressure.
Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability
Everywhere I look, I see that MongoDB is CP.
But when I dig in I see it is eventually consistent.
Is it CP when you use safe=true? If so, does that mean that when I write with safe=true, all replicas will be updated before getting the result?
MongoDB is strongly consistent by default - if you do a write and then do a read, assuming the write was successful you will always be able to read the result of the write you just read. This is because MongoDB is a single-master system and all reads go to the primary by default. If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results.
MongoDB also gets high-availability through automatic failover in replica sets: http://www.mongodb.org/display/DOCS/Replica+Sets
I agree with Luccas post. You can't just say that MongoDB is CP/AP/CA, because it actually is a trade-off between C, A and P, depending on both database/driver configuration and type of disaster: here's a visual recap, and below a more detailed explanation.
Scenario
Main Focus
Description
No partition
CA
The system is available and provides strong consistency
partition, majority connected
AP
Not synchronized writes from the old primary are ignored
partition, majority not connected
CP
only read access is provided to avoid separated and inconsistent systems
Consistency:
MongoDB is strongly consistent when you use a single connection or the correct Write/Read Concern Level (Which will cost you execution speed). As soon as you don't meet those conditions (especially when you are reading from a secondary-replica) MongoDB becomes Eventually Consistent.
Availability:
MongoDB gets high availability through Replica-Sets. As soon as the primary goes down or gets unavailable else, then the secondaries will determine a new primary to become available again. There is an disadvantage to this: Every write that was performed by the old primary, but not synchronized to the secondaries will be rolled back and saved to a rollback-file, as soon as it reconnects to the set(the old primary is a secondary now). So in this case some consistency is sacrificed for the sake of availability.
Partition Tolerance:
Through the use of said Replica-Sets MongoDB also achieves the partition tolerance: As long as more than half of the servers of a Replica-Set is connected to each other, a new primary can be chosen. Why? To ensure two separated networks can not both choose a new primary. When not enough secondaries are connected to each other you can still read from them (but consistency is not ensured), but not write. The set is practically unavailable for the sake of consistency.
As a brilliant new article showed up and also some awesome experiments by Kyle in this field, you should be careful when labeling MongoDB, and other databases, as C or A.
Of course CAP helps to track down without much words what the database prevails about it, but people often forget that C in CAP means atomic consistency (linearizability), for example. And this caused me lots of pain to understand when trying to classify. So, besides MongoDB give strong consistency, that doesn't mean that is C. In this way, if one make this classifications, I recommend to also give more depth in how it actually works to not leave doubts.
Yes, it is CP when using safe=true. This simply means, the data made it to the masters disk.
If you want to make sure it also arrived on some replica, look into the 'w=N' parameter where N is the number of replicas the data has to be saved on.
see this and this for more information.
MongoDB selects Consistency over Availability whenever there is a Partition. What it means is that when there's a partition(P) it chooses Consistency(C) over Availability(A).
To understand this, Let's understand how MongoDB does replica set works. A Replica Set has a single Primary node. The only "safe" way to commit data is to write to that node and then wait for that data to commit to a majority of nodes in the set. (you will see that flag for w=majority when sending writes)
Partition can occur in two scenarios as follows :
When Primary node goes down: system becomes unavailable until a new
primary is selected.
When Primary node looses connection from too many
Secondary nodes: system becomes unavailable. Other secondaries will try to
elect a new Primary and current primary will step down.
Basically, whenever a partition happens and MongoDB needs to decide what to do, it will choose Consistency over Availability. It will stop accepting writes to the system until it believes that it can safely complete those writes.
Mongodb never allows write to secondary. It allows optional reads from secondary but not writes. So if your primary goes down, you can't write till a secondary becomes primary again. That is how, you sacrifice High Availability in CAP theorem. By keeping your reads only from primary you can have strong consistency.
I'm not sure about P for Mongo. Imagine situation:
Your replica gets split into two partitions.
Writes continue to both sides as new masters were elected
Partition is resolved - all servers are now connected again
What happens is that new master is elected - the one that has highest oplog, but the data from the other master gets reverted to the common state before partition and it is dumped to a file for manual recovery
all secondaries catch up with the new master
The problem here is that the dump file size is limited and if you had a partition for a long time you can loose your data forever.
You can say that it's unlikely to happen - yes, unless in the cloud where it is more common than one may think.
This example is why I would be very careful before assigning any letter to any database. There's so many scenarios and implementations are not perfect.
If anyone knows if this scenario has been addressed in later releases of Mongo please comment! (I haven't been following everything that was happening for some time..)
Mongodb gives up availability. When we talk about availability in the context of the CAP theorem, it is about avoiding single points of failure that can go down. In mongodb. there is a primary router host. and if that goes down,there is gonna be some downtime in the time that it takes for it to elect a new replacement server to take its place. In practical, that is gonna happen very qucikly. we do have a couple of hot standbys sitting there ready to go. So as soon as the system detects that primary routing host went down, it is gonna switch over to a new one pretty much right away. Technically speaking it is still single point of failure. There is still a chance of downtime when that happens.
There is a config server, that is the primary and we have an app server, that is primary at any given time. even though we have multiple backups, there is gonna be a brief period of downtime if any of those servers go down. the system has to first detect that there was an outage and then remaining servers need to reelect a new primary host to take its place. that might take a few seconds and this is enough to say that mongodb is trading off the availability