We have MongoDB 4.2.11 with 3 Nodes in replica set - Primary, Secondary and Arbiter.
With a recent incidence in Production, we had Primary node down which caused the Secondary Node to become primary. However in about 5hours the oplog collection almost rose to 15GB causing no free disk space for the current Primary Node. Eventually MongoDB Primary crashed due to no disk space issue.
Question: Is there any way to limit Oplog space in MongoDB 4.2.11? Or the only way forward is to upgrade MongoDB to 4.4 and above.
Here is the link to documentation that clarifies no way to limit oplog in 4.2.
https://www.mongodb.com/docs/v4.2/tutorial/change-oplog-size/
Hope to hear some feedback or any old thread that addresses this. Thanks.
You can disable majority commit point check, so the oplog will not grow above the limit. Set replication.enableMajorityReadConcern to false (true by default) in the config, or --enableMajorityReadConcern=false in the command line.
Please read https://www.mongodb.com/docs/v4.2/reference/configuration-options/#replication.enableMajorityReadConcern and https://www.mongodb.com/docs/v4.2/reference/read-concern-majority/#disable-read-concern-majority to understand consequences. TL;DR version - it's not recommended if you value your data.
Related
Our secondary instances are reporting much higher disk write rate than the primary. Is this expected behavior in a replica set? Given that the oplog gets copied and replayed from the primary periodically, what's contributing to additional writes on secondaries?
Version: 3.4.1
StorageEngine: WiredTiger
Primary: i3.8xlarge
Secondaries: i3en.3xlarge
Update:
1. The issue disappeared after a couple of days.
2. Secondaries are showing disk writes comparable to primary.
We are assuming there was a one-off issue that caused this behavior. In absence of enough historical data we chose to pause the investigation given that the problem has disappeared.
I am trying to improve the oplog of my MongoDB server, because for now it's covering less hours, than I would like (I am not planning to increase oplog file size for now). What I found that there are many noops records in the oplog collection - { "op": "n" } + the whole document on "o". And they could take about ~20%-30% of the physical oplog size.
How could I find the reason for that, because it seems to be not ok ?
We are using MongoDB 3.6 + NodeJS 10 + Mongoose
p.s. it appears for many different collection and use cases, so it's hard to understand what is a application logic behind all these items.
No-op writes are expected in a MongoDB 3.4+ replica set in order to support the Max Staleness specification that helps applications avoid reading from stale secondaries and provides a more accurate measure of replication lag. These no-op writes only happen when the primary is idle. The idle write interval is not currently configurable (as at MongoDB 4.2).
The Max Staleness specification includes an example scenario and more detailed rationale for why the Primary must write periodic no-ops as well as other design decisions.
A relevant excerpt from the design rationale:
An idle primary must execute a no-op every 10 seconds (idleWritePeriodMS) to keep secondaries' lastWriteDate values close to the primary's clock. The no-op also keeps opTimes close to the primary's, which helps mongos choose an up-to-date secondary to read from in a CSRS.
Monitoring software like MongoDB Cloud Manager that charts replication lag will also benefit when spurious lag spikes are solved.
I have a MongoDB v2.4 replica set on AWS and have been monitoring my stats using MMS and dbStats(). Yesterday I saw an increase in both mapped and virtual memory usage, which correlated with an increased data fileSize and looked completely normal...except that the increase occurred on the secondaries a full two hours before it occurred on the primary (all of these members being located in the same data center).
I vaguely recall that not all members of a replica set will necessarily have the same organization of data in their data files, and I know that you can use the compact() command to defragment the files.
The only difference between the primary and the secondaries in this replica set is that, at one time, the primary was taken offline for roughly 20 minutes. It was then brought back online and re-elected as the primary.
My question is: Is there any reason to be alarmed that the primary seemed to lag behind the secondaries when increasing its mapped & virtual memory usage?
In my test Envinroment:
node1:shard1 primary,shard2 primary
node2:shard1 secondary,shard2 secondary
node3:shard1 arbiter,shard2 artbiter
I wrote a multi-thread to concurrently write the mongo replicat set shard,after 1 hour(the primary had 6g data)
I found the secondary status is :recovering
I checked the secondary log,said:stale data from primary oplog
So the reason was my write request very frequent?then render the secondary cannot replicate in time?
or other reasons?
I'm puzzling...
Thanks in advance
This situation can happen if the size of the OpLog is not sufficient to keep a record of all the operations occurring on the primary, or the secondary just can't keep up with the primary. What will happen in that case is the position in the OpLog where the secondary is will be overwritten by the new inserts from the primary. At this point the secondary will report that it's status is Recovering and you will see a RS102 message in the log, indicating that it is too stale to catch up.
To fix the issue you would need to follow the steps outlined in the documentation.
In order to prevent the problem from happening in the future, you would need to tune the size of the OpLog, and make sure that the secondaries are of equivalent hardware configurations.
To help tune the OpLog you can look at the output of db.printReplicationInfo() which will tell you how much time you have in your OpLog. The documentation outlines how to resize the OpLog if it is too small.
"MongoDB in Action" book says:
Imagine you issue a write to the primary node of a replica set. What happens next? First, the write is recorded and then added to the
primary’s oplog. Meanwhile, all sec- ondaries have their own oplogs
that replicate the primary’s oplog. So when a given secondary node is
ready to update itself, it does three things. First, it looks at the
time- stamp of the latest entry in its own oplog. Next, it queries the
primary’s oplog for all entries greater than that timestamp. Finally,
it adds each of those entries to its own oplog and applies the entries
to itself
So this means nodes must be time synchronized? because timestamps must be equal on all nodes.
In general, yes, it is a very good idea to have your hosts synchronized (NTP is the usual solution). In fact I have seen far worse issues caused than an out of sync oplog - different times on database hosts in a cluster should be considered a must.
This is actually mentioned on the Production Notes page in the docs:
http://www.mongodb.org/display/DOCS/Production+Notes#ProductionNotes-Linux
See the note about minimizing clock skew.
Based on the writing you have provided, nodes are basing everything on the timestamp of the most recently received write, not their own clocks. However, the problem happens when the master is stepped down and a secondary becomes the primary. If the time is skewed greatly, it may cause replication to be delayed or other issues.