I understand that the oplog file will split multi updates into individual updates but what about batch inserts? Are those also split into individual inserts?
If I have a write intensive collection with batches of ~20K docs being inserted roughly every 30 seconds, do I / should I consider increasing my oplog size beyond the default? I have a 3 member replica set and mongod is running on a 64 bit Ubuntu server install with the Mongodb data sitting on a 100GB volume.
Here is some data which may or may not be helpful:
gs_rset:PRIMARY> db.getReplicationInfo()
{
"logSizeMB" : 4591.3134765625,
"usedMB" : 3434.63,
"timeDiff" : 68064,
"timeDiffHours" : 18.91,
"tFirst" : "Wed Oct 24 2012 22:35:10 GMT+0000 (UTC)",
"tLast" : "Thu Oct 25 2012 17:29:34 GMT+0000 (UTC)",
"now" : "Fri Oct 26 2012 19:42:19 GMT+0000 (UTC)"
}
gs_rset:PRIMARY> rs.status()
{
"set" : "gs_rset",
"date" : ISODate("2012-10-26T19:44:00Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "xxxx:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 77531,
"optime" : Timestamp(1351186174000, 1470),
"optimeDate" : ISODate("2012-10-25T17:29:34Z"),
"self" : true
},
{
"_id" : 1,
"name" : "xxxx:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 76112,
"optime" : Timestamp(1351186174000, 1470),
"optimeDate" : ISODate("2012-10-25T17:29:34Z"),
"lastHeartbeat" : ISODate("2012-10-26T19:44:00Z"),
"pingMs" : 1
},
{
"_id" : 2,
"name" : "xxxx:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 61301,
"optime" : Timestamp(1351186174000, 1470),
"optimeDate" : ISODate("2012-10-25T17:29:34Z"),
"lastHeartbeat" : ISODate("2012-10-26T19:43:59Z"),
"pingMs" : 1
}
],
"ok" : 1
}
gs_rset:PRIMARY> db.printCollectionStats()
dev_fbinsights
{
"ns" : "dev_stats.dev_fbinsights",
"count" : 6556181,
"size" : 3117699832,
"avgObjSize" : 475.53596095043747,
"storageSize" : 3918532608,
"numExtents" : 22,
"nindexes" : 2,
"lastExtentSize" : 1021419520,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 1150346848,
"indexSizes" : {
"_id_" : 212723168,
"fbfanpage_id_1_date_1_data.id_1" : 937623680
},
"ok" : 1
}
The larger the size of the current primary's oplog, the longer the window of time a replica set member will be able to remain offline without falling too far behind the primary. If it does fall too far behind, it will need a full resync.
The field timeDiffHours as returned by db.getReplicationInfo() reports how many hours worth of data the oplog currently has recorded. After the oplog has filled up and starts overwriting old entries, then start to monitor this value. Do so especially under heavy write load (in which the value will decrease). If you then assume it will never drop below N hours, then N is the maximum number of hours that you can tolerate a replica set member being temporarily offline (e.g. for regular maintenance, or to make an offline backup, or in the event of hardware failure) without performing the full resync. The member would then be able to automatically catch up to the primary after coming back online.
If you're not comfortable with how low N is, then you should increase the size of the oplog. It completely depends on long your maintenance windows are, or how quickly you or your ops team can respond to disaster scenarios. Be liberal in how much disk space you allocate for it, unless you have a compelling need for that space.
I'm assuming here that you're keeping the size of the oplog constant over all replica set members, which is a reasonable thing to do. If not, then plan for the scenario where the replica set member with the smallest oplog gets elected primary.
(To answer your other question: similarly to multi-updates, batch inserts are also fanned out into multiple operations in the oplog)
Edit: Note that data imports and bulk inserts/updates will write data significantly faster to the oplog than your application might at a typical heavy load. To reiterate: be conservative in your estimation for how much time it will take for the oplog to fill.
Related
I am new here and I hope I am posting my question on the right forum. I didn't see where to pick the right forum category for MongoDB..
I have 2 questions -
I am using Mongodb 2.6, and I am in the process of migrating 2 replica sets RS0 & RS1 from a data center to AWS. I have 3 servers on each replica set, making a total of 6 servers. The option that I am using to migrate data to new servers is by expanding the replica sets to the new hardware and then let them catch-up completely before I can remove the nodes on the old hardware from the replica set.
Question-1> How do I validate the data on both replica sets (source & destination) to make sure the data is 100% in sync before I can remove the old replica set from the source? What are the proper commands I can use to check the number of collections and data counts on all collections for all databases I am migrating ?
Question-2> Correct me if I am wrong - My understanding is when using Replica sets, we have to keep odd numbers of members within a RS. Right now I have 3 servers per RS which is fine, but when I add a new member to my current RS, which will be pointing to a new server, I will end up with 4 members - wouldn't that cause a problem ? Should I add 2 members in my RS instead so that I can keep 5 members which is an odd number ?
Thank you so much in advance !
Question 1: use rs.status() on any of the replica set members; you can check the status of each member and the optime field (compare with the primary):
http://docs.mongodb.org/manual/reference/method/rs.status/
Question 2: you need an odd number of members because only one member can be elected as primary, and each member can vote for 1 member, so having an even number of members could lead, during the primary member election, to an equal number of votes for two or more members. To have an odd number of members you can set up an arbitet instance: http://docs.mongodb.org/master/tutorial/add-replica-set-arbiter/
#Stefano gave you the answer.I would just like to add a few:
Question 1:
You can use rs.status to check the replica set status.This sort out the primary and secondary clearly.
{
"set" : "replset",
"date" : ISODate("2015-11-19T15:22:32.597Z"),
"myState" : 1,
"term": NumberLong(1),
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "m1.example.net:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 269,
"optime" : {
"ts" : Timestamp(1447946550, 1),
"t" : NumberLong(1)
},
"optimeDate" : ISODate("2015-11-19T15:22:30Z"),
"infoMessage" : "could not find member to sync from",
"electionTime" : Timestamp(1447946549, 1),
"electionDate" : ISODate("2015-11-19T15:22:29Z"),
"configVersion" : 1,
"self" : true
},
{
"_id" : 1,
"name" : "m2.example.net:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 13,
"optime" : {
"ts" : Timestamp(1447946539, 1),
"t" : NumberLong(-1)
},
"optimeDate" : ISODate("2015-11-19T15:22:19Z"),
"lastHeartbeat" : ISODate("2015-11-19T15:22:31.323Z"),
"lastHeartbeatRecv" : ISODate("2015-11-19T15:22:32.045Z"),
"pingMs" : NumberLong(0),
"configVersion" : 1
},
{
"_id" : 2,
"name" : "m3.example.net:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 13,
"optime" : {
"ts" : Timestamp(1447946539, 1),
"t" : NumberLong(-1)
},
"optimeDate" : ISODate("2015-11-19T15:22:19Z"),
"lastHeartbeat" : ISODate("2015-11-19T15:22:31.325Z"),
"lastHeartbeatRecv" : ISODate("2015-11-19T15:22:31.971Z"),
"pingMs" : NumberLong(0),
"configVersion" : 1
}
],
"ok" : 1
}
To know the slave delay fire rs.printSlaveReplicationInfo():
source: localhost.localdomain:27070
syncedTo: Mon May 02 2016 12:34:36 GMT+0530 (IST)
0 secs (0 hrs) behind the primary
source: localhost.localdomain:27072
syncedTo: Mon May 02 2016 12:34:36 GMT+0530 (IST)
0 secs (0 hrs) behind the primary
source: localhost.localdomain:27073
syncedTo: Mon May 02 2016 12:34:36 GMT+0530 (IST)
0 secs (0 hrs) behind the primary
To know more detail about replication catch up in the oplog try rs.printReplicationInfo():
configured oplog size: 700.0038909912109MB
log length start to end: 261920secs (72.76hrs)
oplog first event time: Fri Apr 29 2016 11:49:16 GMT+0530 (IST)
oplog last event time: Mon May 02 2016 12:34:36 GMT+0530 (IST)
now: Mon May 02 2016 12:49:37 GMT+0530 (IST)
Question 2:
Odd number replicas facilitates high voting in the election.So in case u have even replica sets you can add Arbiters.They are just light weight and do not contain data,they can also reside on any other currently running server.
Hope this helps !!!
I'm having trouble diagnosing an issue where my Java application's requests to MongoDB are not getting routed to the Nearest replica, and I hope someone can help. Let me start by explaining my configuration.
The Configuration:
I am running a MongoDB instance in production that is a Sharded ReplicaSet. It is currently only a single shard (it hasn't gotten big enough yet to require a split). This single shard is backed by a 3-node replica set. 2 nodes of the replica set live in our primary data center. The 3rd node lives in our secondary datacenter, and is prohibited from becoming the Master node.
We run our production application simultaneously in both data centers, however the instance in our secondary data center operates in "read-only" mode and never writes data into MongoDB. It only serves client requests for reads of existing data. The objective of this configuration is to ensure that if our primary datacenter goes down, we can still serve client read traffic.
We don't want to waste all of this hardware in our secondary datacenter, so even in happy times we actively load balance a portion of our read-only traffic to the instance of our application running in the secondary datacenter. This application instance is configured with readPreference=NEAREST and is pointed at a mongos instance running on localhost (version 2.6.7). The mongos instance is obviously configured to point at our 3-node replica set.
From a mongos:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("52a8932af72e9bf3caad17b5")
}
shards:
{ "_id" : "shard1", "host" : "shard1/failover1.com:27028,primary1.com:27028,primary2.com:27028" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard1" }
{ "_id" : "MyApplicationData", "partitioned" : false, "primary" : "shard1" }
From the failover node of the replicaset:
shard1:SECONDARY> rs.status()
{
"set" : "shard1",
"date" : ISODate("2015-09-03T13:26:18Z"),
"myState" : 2,
"syncingTo" : "primary1.com:27028",
"members" : [
{
"_id" : 3,
"name" : "primary1.com:27028",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 674841,
"optime" : Timestamp(1441286776, 2),
"optimeDate" : ISODate("2015-09-03T13:26:16Z"),
"lastHeartbeat" : ISODate("2015-09-03T13:26:16Z"),
"lastHeartbeatRecv" : ISODate("2015-09-03T13:26:18Z"),
"pingMs" : 49,
"electionTime" : Timestamp(1433952764, 1),
"electionDate" : ISODate("2015-06-10T16:12:44Z")
},
{
"_id" : 4,
"name" : "primary2.com:27028",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 674846,
"optime" : Timestamp(1441286777, 4),
"optimeDate" : ISODate("2015-09-03T13:26:17Z"),
"lastHeartbeat" : ISODate("2015-09-03T13:26:18Z"),
"lastHeartbeatRecv" : ISODate("2015-09-03T13:26:18Z"),
"pingMs" : 53,
"syncingTo" : "primary1.com:27028"
},
{
"_id" : 5,
"name" : "failover1.com:27028",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 8629159,
"optime" : Timestamp(1441286778, 1),
"optimeDate" : ISODate("2015-09-03T13:26:18Z"),
"self" : true
}
],
"ok" : 1
}
shard1:SECONDARY> rs.conf()
{
"_id" : "shard1",
"version" : 15,
"members" : [
{
"_id" : 3,
"host" : "primary1.com:27028",
"tags" : {
"dc" : "primary"
}
},
{
"_id" : 4,
"host" : "primary2.com:27028",
"tags" : {
"dc" : "primary"
}
},
{
"_id" : 5,
"host" : "failover1.com:27028",
"priority" : 0,
"tags" : {
"dc" : "failover"
}
}
],
"settings" : {
"getLastErrorModes" : {"ACKNOWLEDGED" : {}}
}
}
The Problem:
The problem is that requests which hit this mongos in our secondary datacenter seem to be getting routed to a replica running in our primary datacenter, not the nearest node, which is running in the secondary datacenter. This incurs a significant amount of network latency and results in bad read performance.
My understanding is that the mongos is deciding which node in the replica set to route the request to, and it's supposed to honor the ReadPreference from my java driver's request. Is there a command I can run in the mongos shell to see the status of the replica set, including ping times to nodes? Or some way to see logging of incoming requests which indicates the node in the replicaSet that was chosen and why? Any advice at all on how to diagnose the root cause of my issue?
While configuring read preference, when ReadPreference = NEAREST the system does not look for minimum network latency as it may decide primary as the nearest, if the network connection is proper. However, the nearest read mode, when combined with a tag set, selects the matching member with the lowest network latency. Even nearest may be any of primary or secondary. Behaviour of mongos when preferences configured , and in terms of network latency is not so clearly explained in the official docs.
http://docs.mongodb.org/manual/core/read-preference/#replica-set-read-preference-tag-sets
hope this helps
If I start mongos with flag -vvvv (4x verbose) then I am presented with request routing information in the log files, including information about the read preference used and the host to which requests were routed. for example:
2015-09-10T17:17:28.020+0000 [conn3] dbclient_rs say
using secondary or tagged node selection in shard1,
read pref is { pref: "nearest", tags: [ {} ] }
(primary : primary1.com:27028,
lastTagged : failover1.com:27028)
Despite the wording, when using nearest, the absolute fastest member isn't necessarily the one chosen. Instead, a random member is chosen out of a pool of members that have a latency lower than the calculated latency window.
The latency window is calculated by taking the fastest member's ping and adding replication.localPingThresholdMs, whose default is 15ms. You can read more about the algorithm here.
So what I do is I combine nearest with tags so that I can specify the member manually that I know is geographically closest.
I have setup the replica set over 3 mongo server and imported the 5 GB data.
now status of secondary server showing "RECOVERING".
Could you let me know what is means for "RECOVERING" and how to solve this issue.
Status is as below
rs.status()
{
"set" : "kutendarep",
"date" : ISODate("2013-01-15T05:04:18Z"),
"myState" : 3,
"members" : [
{
"_id" : 0,
"name" : "10.1.4.138:27017",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 86295,
"optime" : Timestamp(1357901076000, 4),
"optimeDate" : ISODate("2013-01-11T10:44:36Z"),
"errmsg" : "still syncing, not yet to minValid optime 50f04941:2",
"self" : true
},
{
"_id" : 1,
"name" : "10.1.4.21:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 86293,
"optime" : Timestamp(1358160135000, 18058),
"optimeDate" : ISODate("2013-01-14T10:42:15Z"),
"lastHeartbeat" : ISODate("2013-01-15T05:04:18Z"),
"pingMs" : 0
},
{
"_id" : 2,
"name" : "10.1.4.88:27017",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 86291,
"optime" : Timestamp(1357900674000, 10),
"optimeDate" : ISODate("2013-01-11T10:37:54Z"),
"lastHeartbeat" : ISODate("2013-01-15T05:04:16Z"),
"pingMs" : 0,
"errmsg" : "still syncing, not yet to minValid optime 50f04941:2"
}
],
"ok" : 1
The message on the "RECOVERING" replica set nodes means that these are still performing the initial sync.
These nodes are not available for reads until they transitions to the Secondary state.
There are several steps in the initial sync.
See here for more information about the replica set synchronization process:
http://docs.mongodb.org/manual/core/replica-set-sync/
Login to RECOVERING instance.
Check RECOVERING instance replication status with,
db.printReplicationInfo()
You will get a result like this,
oplog first event time: Tue Jul 30 2019 17:26:37 GMT+0000 (UTC)
oplog last event time: Wed Jul 31 2019 16:46:53 GMT+0000
now: Thu Aug 22 2019 07:36:38 GMT+0000 (UTC)
If you find the difference between oplog last event time and now.
That means this particular instance is not PRIMARY and SECONDARY and not an active member of the replica set.
Now there are two solutions for this
First,
1. Login to RECOVERING instance
2. Delete data from existing db which will be /data/db
3. Restart this RECOVERING instance
4. (optional) If you find the following error. Remove that mongod.pid from the specified location.
Error starting mongod. /var/run/mongod/mongod.pid
5. Restart instance.
6. Now your recovering instance will be running state and It will show PRIMARY or Secondary in place of RECOVERING.
Second,
Copy other running instance data into RECOVERING instance and restart mongodb.
I have 2 shards.
One is over standalone server and another over replicaset:
mongos> db.runCommand({listshards:1})
{
"shards" : [
{
"_id" : "shard0000",
"host" : "mongo3:10001"
},
{
"_id" : "set1",
"host" : "set1/mongo1:10001,mongo2:10001"
}
],
"ok" : 1
}
I've inserted about 30M records.
As far as I understand mongo should balance equally the data between the shards, but it does not happen:
mongos> db.stats()
{
"raw" : {
"set1/mongo1:10001,mongo2:10001" : {
"db" : "my_ginger",
"collections" : 3,
"objects" : 5308714,
"avgObjSize" : 811.9953284354742,
"dataSize" : 4310650968,
"storageSize" : 4707774464,
"numExtents" : 23,
"indexes" : 2,
"indexSize" : 421252048,
"fileSize" : 10666115072,
"nsSizeMB" : 16,
"ok" : 1
},
"mongo3:10001" : {
"db" : "my_ginger",
"collections" : 6,
"objects" : 25162626,
"avgObjSize" : 1081.6777010475776,
"dataSize" : 27217851444,
"storageSize" : 28086624096,
"numExtents" : 38,
"indexes" : 6,
"indexSize" : 1903266512,
"fileSize" : 34276900864,
"nsSizeMB" : 16,
"ok" : 1
}
},
"objects" : 30471340,
"avgObjSize" : 1034.6936633571088,
"dataSize" : 31528502412,
"storageSize" : 32794398560,
"numExtents" : 61,
"indexes" : 8,
"indexSize" : 2324518560,
"fileSize" : 44943015936,
"ok" : 1
}
What am I doing wrong?
Thanks.
According the sh.status() output in the comments, you have 164 chunks on shard0000 (the single host) and 85 on set1 (the replica set). There are a couple of common reasons that this kind of imbalance can happen:
You picked a bad shard key (monotonically increasing or similar)
All your data was initially on a single shard and is being rebalanced
The balancer will be continuously attempting to move chunks from the high shard to the low while at the same time moving the max-chunk around (for people that pick the aforementioned monotonically increasing keys, this helps). However, there can only be one migration at the time, so this will take some time, especially if you continue writing/reading from the shards at the same time. If things are really bad, and you did pick a poor shard key, then this may persist for some time.
If all your data was on one shard first, and then you added another, then you have a similar problem - it will take a while for the chunk count to stabilise because half the data has to be moved from the original shard (in addition to its other activities) to balance things out. The balancer will pick low range chunks to move first in general, so if these are less likely to be in memory (back to the poor shard key again), then they will have to be paged in before they can be migrated.
To check the balancer is running:
http://docs.mongodb.org/manual/reference/method/sh.setBalancerState/#sh.getBalancerState
Then, to see what it has been up to, connect to a mongos (last 10 operations):
use config
db.changelog.find().sort({$natural:-1}).limit(10).pretty()
Similarly you will see messaging in the primary logs of each shard relating to the migrations, how long they are taking etc. if you want to see their performance.
I have set up a replica set with 4 servers.
For testing purpose, I wrote a script to fill my database up to ~150 millions rows of photos using GridFS. My photos are around ~15KB. (This shouldn't be a problem to use gridfs for small files ?!)
After after a few hours, there were around 50 millions rows, but I had this message in the logs :
replSet error RS102 too stale to catch up, at least from 192.168.0.1:27017
And here is the replSet status :
rs.status();
{
"set" : "rsdb",
"date" : ISODate("2012-07-18T09:00:48Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "192.168.0.1:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" : {
"t" : 1342601552000,
"i" : 245
},
"optimeDate" : ISODate("2012-07-18T08:52:32Z"),
"self" : true
},
{
"_id" : 1,
"name" : "192.168.0.2:27018",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 64770,
"optime" : {
"t" : 1342539026000,
"i" : 5188
},
"optimeDate" : ISODate("2012-07-17T15:30:26Z"),
"lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"),
"pingMs" : 0,
"errmsg" : "error RS102 too stale to catch up"
},
{
"_id" : 2,
"name" : "192.168.0.3:27019",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 64735,
"optime" : {
"t" : 1342539026000,
"i" : 5188
},
"optimeDate" : ISODate("2012-07-17T15:30:26Z"),
"lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"),
"pingMs" : 0,
"errmsg" : "error RS102 too stale to catch up"
},
{
"_id" : 3,
"name" : "192.168.0.4:27020",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 65075,
"optime" : {
"t" : 1342539085000,
"i" : 3838
},
"optimeDate" : ISODate("2012-07-17T15:31:25Z"),
"lastHeartbeat" : ISODate("2012-07-18T09:00:46Z"),
"pingMs" : 0,
"errmsg" : "error RS102 too stale to catch up"
}
],
"ok" : 1
The set is still accepting datas, but as I have my 3 servers "DOWN" how should I proceed to repair (nicer than delete datas and re-sync which wil take ages, but will work) ?
And especially :
Is this something because of a too violent script ? Meaning that it almost never happens in production ?
You don't need to repair, simply perform a full resync.
On the secondary, you can:
stop the failed mongod
delete all data in the dbpath (including subdirectories)
restart it and it will automatically resynchronize itself
Follow the instructions here.
What's happened in your case is that your secondaries have become stale, i.e. there is no common point in their oplog and that of the oplog on the primary. Look at this document, which details the various statuses. The writes to the primary member have to be replicated to the secondaries and your secondaries couldn't keep up until they eventually went stale. You will need to consider resizing your oplog.
Regarding oplog size, it depends on how much data you insert/update over time. I would chose a size which allows you many hours or even days of oplog.
Additionally, I'm not sure which O/S you are running. However, for 64-bit Linux, Solaris, and FreeBSD systems, MongoDB will allocate 5% of the available free disk space to the oplog. If this amount is smaller than a gigabyte, then MongoDB will allocate 1 gigabyte of space. For 64-bit OS X systems, MongoDB allocates 183 megabytes of space to the oplog and for 32-bit systems, MongoDB allocates about 48 megabytes of space to the oplog.
How big are records and how many do you want? It depends on whether this data insertion is something typical or something abnormal that you were merely testing.
For example, at 2000 documents per second for documents of 1KB, that would net you 120MB per minute and your 5GB oplog would last about 40 minutes. This means if the secondary ever goes offline for 40 minutes or falls behind by more than that, then you are stale and have to do a full resync.
I recommend reading the Replica Set Internals document here. You have 4 members in your replica set, which is not recommended. You should have an odd number for the voting election (of primary) process, so you either need to add an arbiter, another secondary or remove one of your secondaries.
Finally, here's a detailed document on RS administration.