Is it possible to set the WriteConcern to something like all which means the insert/update will only return when all "currently functional" (at time of the operation) replica members acknowledges the operation?
As
the 'majority' setting leave some members unaccounted for.
if we specify a numeric value, the insert/update may suspend indefinitely if we set the WriteConcern as "total number of members" and any replica members go down for any reason.
if we use tag set, as outlined in official docs, we still need to supply a numeric value to each tag and if we specify the numeric value as the total members count and any member goes down, the result would be same as 2nd point.
What we have in mind is if there is a setting forWriteConcern which is, dynamically, the total number of replica members at the time of insert/update.
Thanks in advance!
Is it possible to set the WriteConcern to something like all which means the insert/update will only return when all "currently functional" (at time of the operation) replica members acknowledges the operation?
There is a logical contradiction in suggesting your use case requires strong consistency except when that isn't possible. There are expected scenarios such as replication lag, maintenance, or failure (and subsequent recovery) where one or more of your replica set secondaries may be online but lagging behind the current primary.
If your use case requires strong consistency then you should always read from the primary instead of secondaries, and use a write concern of majority / replica_safe to ensure data has replicated sufficiently for high availability in the event of failover.
The default read preference is to direct reads to the primary for consistency. Secondaries in MongoDB replica sets are generally intended to support high availability rather than read scaling (with a few exceptions such as distribution across multiple data centres). For a lengthier explanation see: Can I use more replica nodes to scale?.
the 'majority' setting leave some members unaccounted for.
The majority write concern matches up with the majority required to elect a primary in the event of a replica set election. The replica set election mechanics include ensuring that a new primary is up-to-date with the most recent operation available in the replica set (of the nodes that are participating in the election).
if we specify a numeric value, the insert/update may suspend indefinitely if we set the WriteConcern as "total number of members" and any replica members go down for any reason.
That is the expected default behaviour, however there is a wtimeout write concern option which sets a time limit (in milliseconds) so a write will not block indefinitely waiting for an acknowledgement to be satisfied.
The caveats on using timeouts are very important, and provide even less certainty of outcome:
wtimeout causes write operations to return with an error after the specified limit, even if the required write concern will eventually succeed. When these write operations return, MongoDB does not undo successful data modifications performed before the write concern exceeded the wtimeout time limit.
The write concern timeout is not directly related to the current health of the replica set members (i.e. whether they online or offline and might be able to acknowledge a write concern) or the eventual outcome -- it's just a hard stop on how long your application will wait for a response before returning.
if we use tag set, as outlined in official docs, we still need to supply a numeric value to each tag and if we specify the numeric value as the total members count and any member goes down, the result would be same as 2nd point.
Correct.
Related
In general, MongoDB will replicate from a Primary to Secondaries asynchronously, based on number of write operations, time and other factors by shipping oplog from primary to secondaries.
When describing WriteConcern options, MongoDB documentation states "...primary waits until the required number of secondaries acknowledge the write before returning write concern acknowledgment". This seems to suggest that a WriteConcern other than "w:1" would replicate to at least some of the members of the replica set in a blocking manner, potentially avoiding log shipping.
The basic question I'm trying to answer is this: if every write is using WriteCocnern of "majority", would MongoDB ever have to use log shipment? In other words, is using WriteCocnern of "majority" also controls replication timing?
I would like to better understand how MongoDB handles WriteConcern of "majority". A few obvious options:
Primary sends write requests to every Secondary, and blocks the thread until majority respond with acknowledgment
or
Primary pre-selects Secondaries first and sends requests to only those secondaries, blocking the thread until all chosen secondaries respond with acknowledgment
or
Something much smarter than either of these options
If Option 1 is used, in most cases (assuming equidistant placement of secondaries) all secondaries will have received the write operation by the time Write completes, and there's high probability (although not a guarantee) all secondaries will have applied it. If true, this behavior enables use cases where writes need to be reflected on Secondaries quicker than typical asynchronous replication process.
Obviously WriteConcern of "majority" will incur performance penalty, but this may be acceptable for specific use cases where read operations may target Secondaries (e.g. ReadPreference of "nearest") and desire more recent data.
if every write is using WriteConcern of "majority", would MongoDB ever have to use log shipment?
Replication in MongoDB uses what is termed as the oplog. This is a record of all operations on the primary (the only node that accept writes).
Instead of pushing the oplog into the secondaries, the secondaries long-pull on the oplog of the primary. If replication chaining is allowed (the default), then a secondary can also pull the oplog from another secondary. So scenario 1 & 2 you posted are not the reality with MongoDB replication as of MongoDB 4.0.
The details of the replication process is described in MongoDB Github wiki page: Replication Internals.
To quote the relevant parts regarding your question:
If a command includes a write concern, the command will just block in its own thread until the oplog entries it generates have been replicated to the requested number of nodes. The primary keeps track of how up-to-date the secondaries are to know when to return. A write concern can specify a number of nodes to wait for, or majority.
In other words, the secondaries continually report back to the primary how far along it has applied the oplog into its own dataset. Since the primary knows the timestamp that the write took place, once a secondary has applied that timestamp, it can tell that the write has propagated to that secondary. To satisfy the write concern, the primary simply waits until a determined number of secondaries have applied the write timestamp.
Note that only the thread specifying the write concern is waiting for this acknowledgment. All other threads are not blocked due to this waiting at all.
Regarding to you other question:
Obviously WriteConcern of "majority" will incur performance penalty, but this may be acceptable for specific use cases where read operations may target Secondaries (e.g. ReadPreference of "nearest") and desire more recent data.
To achieve what you described, you need a combination of read and write concerns. See
Causal Consistency and Read and Write Concerns for more details on this subject.
Write majority is typically used for:
Ensuring that the write will not be rolled back in the event of the primary failure.
Ensuring that the application is not writing so fast that the provisioned hardware of the replica set cannot cope with the traffic; i.e. it can act as a backpressure mechanism.
In combination with read concern, provide the client with differing levels of consistency guarantees.
These points assume that the write majority was acknowledged and the acknowledgment was received by the client. There are multiple different failure scenario that are possible (as expected with a distributed system that needs to cope with unreliable network), but those are beyond the scope of this discussion.
My application is essentially a bunch of microservices deployed across Node.js instances. One service might write some data while a different service will read those updates. (specific example, I'm processing data that is inbound to my solution using a processing pipeline. Stage 1 does something, stage 2 does something else to the same data, etc. It's a fairly common pattern)
So, I have a large data set (~250GB now, and I've read that once a DB gets much larger than this size, it is impossible to introduce sharding to a database, at least, not without some major hoop jumping). I want to have a highly available DB, so I'm planning on a replica set with at least one secondary and an arbiter.
I am still researching my 'sharding' options, but I think that I can shard my data by the 'client' that it belongs to and so I think it makes sense for me to have 3 shards.
First question, if I am correct, if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?
Second question. I've read conflicting info about what 'majority' means... If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down? If the Arbiter is still there, the election can happen and I'll still have a Primary. But, does Majority refer to members of the replication set? Or to Secondaries? So, if I only have a Primary and I try to write with 'majority' option, will I ever get an acknowledgement? If there is only a Primary, then 'majority' would mean a write to the Primary alone triggers the acknowledgement. Or, would this just block until my timeout was reached and then I would get an error?
Third question... I'm assuming that as long as I do writes with 'majority' acknowledgement and do reads from all the Primaries, I don't need to worry about causally consistent data? I've read that doing reads from 'Secondary' nodes is not worth the effort. If reading from a Secondary, you have to worry about 'eventual consistency' and since writes are getting synchronized, the Secondaries are essentially seeing the same amount of traffic that the Primaries are. So there isn't any benefit to reading from the Secondaries. If that is the case, I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards. Is this correct?
Fourth (and last) question... When are causally consistent sessions worthwhile? If I understand correctly, and I'm not sure that I do, then I think it is when I have a case like a typical web app (not some distributed application, like my current one), where there is just one (or two) nodes doing the reading and writing. In that case, I would use causally consistent sessions and do my writes to the Primary and reads from the Secondary. But, in that case, what would the benefit of reading from the Secondaries be, anyway? What am I missing? What is the use case for causally consistent sessions?
if I have 3 shards and my replica set is Primary/Secondary/Arbiter (with Arbiter running on the Primary), I will have 6 instances of MongoDB running. There will be three primaries and three secondaries (with the Arbiter running on each Primary). Is this correct?
A replica set Arbiter is still an instance of mongod. It's just that an Arbiter does not have a copy of the data and cannot become a Primary. You should have 3 instances per shard, which means 9 instances in total.
Since you mentioned that you would like to have a highly available database deployment, please note that the minimum recommended replica set members for production deployment would be a Primary with two Secondaries.
If I have a Primary and Secondary and I'm writing using the 'majority' write acknowledgement, what happens when either the Primary or Secondary goes down?
When either the Primary or Secondary becomes unavailable, a w:majority writes will either:
Wait indefinitely,
Wait until either nodes is restored, or
Failed with timeout.
This is because an Arbiter carries no data and unable to acknowledge writes but still counted as a voting member. See also Write Concern for Replica sets.
I can do all reads from the Primaries (using 'majority' read concern) and be sure that I'm always getting consistent data and the sharding I'm doing is giving me some benefits from distributing the load across the shards
Correct, MongoDB Sharding is to scale horizontally to distribute load across shards. While MongoDB Replication is to provide high availability.
If you read only from the Primary and also specifies readConcern:majority, the application will read data that has been acknowledged by the majority of the replica set members. This data is durable in the event of partition (i.e. not rolled back). See also Read Concern 'majority'.
What is the use case for causally consistent sessions?
Causal Consistency is used if the application requires an operation to be logically dependent on a preceding operation (causal). For example, a write operation that deletes all documents based on a specified condition and a subsequent read operation that verifies the delete operation have a causal relationship. This is especially important in a sharded cluster environment, where write operations may go to different replica sets.
I am reading WriteConcern at mongoDB wiki but it's not clear for me. I have a question! what is it and when we must use of WithWriteConcern(WriteConcern.Acknowledged)?
what is difference between:
WithWriteConcern(WriteConcern.Acknowledged).InsertOne()
and InsertOne() and which one is better that we use?
please explain simple.
sayres, the write concern is an specification of MongoDB for write operations that determines the acknowledgement you want after a write operation has taken place. MongoDB has a default write concern of always acknowledging all writes, which means that after every write, MongoDB has to always return an acknowledgement (in a form of a document), meaning that it was successful. When asking for write acknowledgement, if none isn't returned (in case of failover, crashes), the write isn't successful. This behavior is very useful specially on replica set usage, since you will have more than one mongod instance, and depending on your needs, maybe you don't want all instances to acknowledge the write, just a few, to speed up writes. Also, when to specify a write concern, you can specify journal writing, so you can guarantee that operation result and any rollbacks required if a failover happens. More information, here.
In your case, it depends on how many mongod (if you have replica sets or just a single server) instances you have. Since "always acknowledge" is the default, you may want to change it if you have to manage replica sets operations and speed things up or just doesn't care about write acknowledgement in a single instance (which is not so good, since it's a single server only).
I have a 3 node MongoDB (2.6.3) replica set and am testing various failure scenarios.
It was my understanding that if a majority of the replica nodes are not available then the replica set becomes read only. But what I am experiencing is if I shut down my 2 secondary nodes, the last remaining node (which was previously primary) becomes a secondary and I cannot even read from it.
From the docs:
Users may configure read preference on a per-connection basis to
prefer that the read operations return on the secondary members. If
clients configure the read preference to permit secondary reads, read
operations can return from secondary members that have not replicated
more recent updates or operations. When reading from a secondary, a
query may return data that reflects a previous state.
It sounds like I can configure my client to allow reads from secondaries, but since it was the primary node that I left up, it should be up to date with all of the data. Does MongoDB make the last node secondary even if it is fully caught up with data?
As you've noted, once you've shut down the two secondaries, your primary steps down and becomes a secondary (it's a normal scenario once a primary looses connection to the majority of members).
The default read preference of a replica set is to read from primary, but since your former primary is not even primary anymore, as you have encountered , "I cannot even read from it."
You can change read-preference on a driver/database/collection and even operation basis.
since it was the primary node that I left up, it should be up to date with all of the data. Does MongoDB make the last node secondary even if it is fully caught up with data?
As said, the primary becomes secondary as it steps down, nothing to do with the fact that it's up to date or not. It wouldn't read even because of the default read preference, if you will change your driver preference to secondary , nearest or so, you'll be able to continue reading even if a single node (former primary) remains.
The current configuration of MongoDb is:
one primary(A), two secondaries(B and C), all part of one replica set
inserts to the primary are done with write-concern: majority
read preference is set to "nearest" when reading from the replica set
Scenario:
an insert is triggered, which means that it will be successful only after it propagates to the majority of the replica set members
the application cannot read from the primary until the write operation returns (reference)
since the write concern is set to "majority", the write operation will return only after it propagates to at least one secondary (B) instance, in our case with the setup of 3 members
this means that the secondary (B) is also locked for reading, as it is replicating (according to this)
The question is, since the application is set to read from the nearest instance, and let's say the nearest instance is the other secondary (C), if a read request comes through while the write operation is still in progress on the other 2 instances, would the read be allowed or blocked. If it will be allowed, how can I prevent it?
Write concern doesn't really work that way. B and C will both process the write, and take the same db-level write lock while they do it, regardless of whether you send a getLastError with any write concern. While the lock is held on C, reads on C will block.
Write concern is really just for the client, it makes the client wait until a condition (in your case, a majority of the replicas have applied the write) is satisfied. It doesn't change how the secondaries prioritize the replication.
if a read request comes through while the write operation is still in progress on the other 2 instances, would the read be allowed or blocked
Well, you sort of figured it out yourself. You can read (a stale data) from 'C' if it's in the nearest group
how can I prevent it?
Read preference can be applied globally by your driver, database level, collection level or operation level (same can be applied more or less for write concern). If for that certain operation you can't suffer stale data, you can override your read preference for that specific query to primary after you had issued the insert (note that in that scenario the insert operation can be set with a write concern of {w:1}