I read a lot the MongoDB documentation, but I couldn't understand the difference between readConcern and readPreference options.
For example: what is the result if I set 'majority' in my read concern option and 'primary' as a my read preference option? These two options seems contraditory.
I know that at query level I can only set the readConcern preference, but at client level I can set readPreference also.
In a replica set the primary MongoDB instance is the master instance that receives all write operations.
The primary read preference is the default mode and concerns MongoDB clients; it's a driver/client option. That means you read data from the master instance where it is written first (before replicated to other replica set members).
If you use other modes than the primary read preference then you risk to read stale data.
Read concern is a query option for replica sets. By default the read concern is local. This returns the most recent data available at the time of the query execution. The data may not have been persisted to the majority of the replica set members and may be rolled back. The option can be set to majority, which will make the query read the most recent data that has been persisted to the majority of the replica set members and will not be rolled back. However you have to set that up properly (works only with WiredTiger engine and some other requirements...) and you might miss more recent data that is written but not persisted to the majority of replica set members.
Let's assume that you use default options for read preference and read concern. Then your MongoDB driver will route read request to the primary replica set member (master instance) and that instance would return the most recent data available at that moment. That data might not have been persisted to the majority of the replica set members and might be rolled back.
Similarly you can think of use cases where you use a different combination of the read concern and read preference options.
local / primaryPreferred
local / secondary
local / secondaryPreferred
local / nearest
majority / primaryPreferred
majority / secondary
majority / secondaryPreferred
majority / nearest
The options are described in the MongoDB Doc. Some combinations might make sense in some situations and some other combinations may make sense in other situations. I simply listed them here for completeness. And I'd interpret that as follows:
the request is routed according to the read preference option (driver option)
second the request is executed according to the read concern option (query option)
readConcern - is the way we want to read data from mongo - that means if we have a replica set, then readConcern majority will allow to get data saved (persisted) to majority of replica set, so we can be assured that this document will not be rolled back if there be an issue with replication.
readPreference - can help with balancing load, so we can have a report generator process which will always read data from secondary and leave primary server to serve data for online users.
Related
In MongoDB 4.4.1 there is mirroredRead configuration which allows primary to forward read/update requests to secondary replicaset.
How it is different from secondaryPreferred readPerence when its sampling rate is set to 1.0?
What is the use-case of mirroredRead?
reference - https://docs.mongodb.com/manual/replication/#mirrored-reads-supported-operations
What is the use-case of mirroredRead?
This is described in the documentation you linked:
MongoDB provides mirrored reads to pre-warm the cache of electable secondary members
If you are not familiar with cache warming, there are many resources describing it, e.g. https://www.section.io/blog/what-is-cache-warming/.
A secondary read:
Is sent to the secondary, thus reducing the load on the primary
Can return stale data
A mirrored read:
Is sent to the primary
Always returns most recent data
mirroredRead configuration which allows primary to forward read/update requests to secondary replicaset.
This is incorrect:
A mirrored read is not applicable to updates.
The read is not "forwarded". The primary responds to the read using its local data. Additionally, the primary sends a read request to one or more secondaries, but does not receive a result of this read at all (and does not "forward" the secondary read result back to the application).
Let's suppose you always use primary read preference and you have 2 members that are electable for being primary.
Since all of your reads are taking place in primary instance, its cache is heavily populated and since your other electable member doesn't receive any reads, its cache can be considered to be empty.
Using mirrored reads, the primary will send a portion (in your question 100%) of read requests to that secondary as well, to make her familiar with the pattern of read queries and populate its cache.
Suddenly a disaster occurs and current primary goes down. Now your new primary has a pre-warmed cache that can respond to queries as fast as the previous primary, without shocking the system to populate its cache.
Regarding the impact of sampling rate, MongoDB folks in their blog post introducing this feature stated that increasing the sampling rate would increase load on the Replica Set. My understanding is that you may already have queries with read preference other than primary that makes your secondary instance already busy. In this case, these mirrored reads can impact on the performance of your secondary instance. Hence, you may not want to perform all primary reads again on these secondaries (The repetition of terms secondary and primary is mind blowing!).
The story with secondaryPreferred reads is different and you're querying secondaries for data, unless there is no secondary.
Let's say I am having one Primary (A) & two secondary (B, C). If I am doing write using write majority. Can some one please explain my below doubts:-
Let's say a write was done using majority and it updated the data in
A & B and the write did not yet propagate to C. At this time if a
read comes for the same data using secondary or secondary preferred
will the query be served from B which is having the latest data or
mongo cannot gurantee this and the read may return a stale data from
C.
Let's say a write was done using majority again and let's say the
write was done on A and then a write is on progress in one of the
secondary B. If a read comes at that time will the read be blocked
or it will serve a stale data from C?
Let's say I have taken out the secondary C and the same case is in
progress as we mentioned in the above case. Will the read from
secondary B be blocked till the write is complete on B or the read
will not be blocked and a stale data will be served from B?
Environment
Mongo Version - 3.0.9
Storage Engine - MMAPv1
Mongodb replication process is async to secondary. If the read concern is set as 'majority', you may read the stale data. Basically, this means you have set the read preference as Eventual consistency.
If the read concern is set as "local", you will get the latest data from primary.
Please note that readConcern level of "majority" can be used in WiredTiger storage engine only. The WiredTiger storage engine is append only storage engine and doesn't use in place updates. There is no locks and offers document level concurrency.
Read concern = "majority"
The query returns the instance’s most recent copy of data confirmed as
written to a majority of members in the replica set.
To use a read concern level of "majority", you must use the WiredTiger
storage engine and start the mongod instances with the
--enableMajorityReadConcern command line option (or the replication.enableMajorityReadConcern setting if using a configuration
file).
Question 1: The Mongo does not guarantee that the read will be served from the secondary in which the data is written?
Answer 1: MongoDB doesn't guarantee this. The selection of the secondary depends on the following:-
When you select non-primary read preference, the driver will determine which member to target based on various factors. Refer this link.
Read preference mechanics member selection
Question 2: The reading will never be blocked even if a write is on progress on the same data?
Answer 2: Reading will not be blocked. However, you may read some stale data.
Reads may miss matching documents that are updated during the course
of the read operation.
Concurrency locking what isolation guarantees does MongoDB provide
Here is my question:
I would like to use MongoDB replica capabilities to provide a read-only replica set of data to be pushed to devices.
My problem right now is that I would like to know when certain documents are inserted/updated AND replicated accross all nodes.
As I an sending notifications on top, I would like to make sure this data is updated before sending them.
You can do this by specifying tags for the read members and providing a custom write concern for the insert/update operations so that they will only return after the operation has completed and been replicated to the tagged nodes you care about.
You can read more about it here:
http://docs.mongodb.org/manual/core/replica-set-write-concern/#custom-write-concerns
First to make sure your data is in synch across secondaries you have to set appropriate write concerns when insertig/updating
Write concern docs
Then to see if data was inserted/updated you have to monitor the replica primary oplog file
Replica Oplog docs
Note I believe you can state that your replica set is consistent when all the members have roughly the same oplog files.
Obviously, I know why to use a replica set in general.
But, I'm confused about the difference between connecting directly to the PRIMARY mongo instance and connecting to the replica set. Specifically, if I am connecting to Mongo from my node.js app using Mongoose, is there a compelling reason to use connectSet() instead of connect()? I would assume that the failover benefits would still be present with connect(), but perhaps this is where I am wrong...
The reason I ask is that, in mongoose, the connectSet() method seems to be less documented and well-used. Yet, I cannot imagine a scenario where you would NOT want to connect to the set, since it is recommended to always run Mongo on a 3x+ replica set...
If you connect only to the primary then you get failover (that is, if the primary fails, there will be a brief pause until a new master is elected). Replication within the replica set also makes backups easier. A downside is that all writes and reads go to the single primary (a MongoDB replica set only has one primary at a time), so it can be a bottleneck.
Allowing connections to slaves, on the other hand, allows you to scale for reads (not for writes - those still have to go the primary). Your throughput is no longer limited by the spec of the machine running the primary node but can be spread around the slaves. However, you now have a new problem of stale reads; that is, there is a chance that you will read stale data from a slave.
Now think hard about how your application behaves. Is it read-heavy? How much does it need to scale? Can it cope with stale data in some circumstances?
Incidentally, the point of a minimum 3 members in the replica set is to offer resiliency and safe replication, not to provide multiple nodes to connect to. If you have 3 nodes and you lose one, you still have enough nodes to elect a new primary and have replication to a backup node.
I have decided to start developing a little web application in my spare time so I can learn about MongoDB. I was planning to get an Amazon AWS micro instance and start the development and the alpha stage there. However, I stumbled across a question here on Stack Overflow that concerned me:
But for durability, you need to use at least 2 mongodb server
instances as master/slave. Otherwise you can lose the last minute of
your data.
Is that true? Can't I just have my box with everything installed on it (Apache, PHP, MongoDB) and rely on the data being correctly stored? At least, there must be a config option in MongoDB to make it behave reliably even if installed on a single box - isn't there?
The information you have on master/slave setups is outdated. Running single-server MongoDB with journaling is a durable data store, so for use cases where you don't need replica sets or if you're in development stage, then journaling will work well.
However if you're in production, we recommend using replica sets. For the bare minimum set up, you would ideally run three (or more) instances of mongod, a 'primary' which receives reads and writes, a 'secondary' to which the writes from the primary are replicated, and an arbiter, a single instance of mongod that allows a vote to take place should the primary become unavailable. This 'automatic failover' means that, should your primary be unable to receive writes from your application at a given time, the secondary will become the primary and take over receiving data from your app.
You can read more about journaling here and replication here, and you should definitely familiarize yourself with the documentation in general in order to get a better sense of what MongoDB is all about.
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
In some cases, you can use replication to increase read capacity. Clients have the ability to send read and write operations to different servers. You can also maintain copies in different data centers to increase the locality and availability of data for distributed applications.
Replication in MongoDB
A replica set is a group of mongod instances that host the same data set. One mongod, the primary, receives all write operations. All other instances, secondaries, apply operations from the primary so that they have the same data set.
The primary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency. To support replication, the primary logs all changes to its data sets in its oplog. See primary for more information.