RS102 MongoDB on ReplicaSet

RS102 MongoDB on ReplicaSet - mongodb

I have set up a replica set with 4 servers.
For testing purpose, I wrote a script to fill my database up to ~150 millions rows of photos using GridFS. My photos are around ~15KB. (This shouldn't be a problem to use gridfs for small files ?!)
After after a few hours, there were around 50 millions rows, but I had this message in the logs :
replSet error RS102 too stale to catch up, at least from 192.168.0.1:27017
And here is the replSet status :
rs.status();
{
"set" : "rsdb",
"date" : ISODate("2012-07-18T09:00:48Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "192.168.0.1:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" : {
"t" : 1342601552000,
"i" : 245
},
"optimeDate" : ISODate("2012-07-18T08:52:32Z"),
"self" : true
},
{
"_id" : 1,
"name" : "192.168.0.2:27018",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 64770,
"optime" : {
"t" : 1342539026000,
"i" : 5188
},
"optimeDate" : ISODate("2012-07-17T15:30:26Z"),
"lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"),
"pingMs" : 0,
"errmsg" : "error RS102 too stale to catch up"
},
{
"_id" : 2,
"name" : "192.168.0.3:27019",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 64735,
"optime" : {
"t" : 1342539026000,
"i" : 5188
},
"optimeDate" : ISODate("2012-07-17T15:30:26Z"),
"lastHeartbeat" : ISODate("2012-07-18T09:00:47Z"),
"pingMs" : 0,
"errmsg" : "error RS102 too stale to catch up"
},
{
"_id" : 3,
"name" : "192.168.0.4:27020",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 65075,
"optime" : {
"t" : 1342539085000,
"i" : 3838
},
"optimeDate" : ISODate("2012-07-17T15:31:25Z"),
"lastHeartbeat" : ISODate("2012-07-18T09:00:46Z"),
"pingMs" : 0,
"errmsg" : "error RS102 too stale to catch up"
}
],
"ok" : 1
The set is still accepting datas, but as I have my 3 servers "DOWN" how should I proceed to repair (nicer than delete datas and re-sync which wil take ages, but will work) ?
And especially :
Is this something because of a too violent script ? Meaning that it almost never happens in production ?

You don't need to repair, simply perform a full resync.
On the secondary, you can:
stop the failed mongod
delete all data in the dbpath (including subdirectories)
restart it and it will automatically resynchronize itself
Follow the instructions here.
What's happened in your case is that your secondaries have become stale, i.e. there is no common point in their oplog and that of the oplog on the primary. Look at this document, which details the various statuses. The writes to the primary member have to be replicated to the secondaries and your secondaries couldn't keep up until they eventually went stale. You will need to consider resizing your oplog.
Regarding oplog size, it depends on how much data you insert/update over time. I would chose a size which allows you many hours or even days of oplog.
Additionally, I'm not sure which O/S you are running. However, for 64-bit Linux, Solaris, and FreeBSD systems, MongoDB will allocate 5% of the available free disk space to the oplog. If this amount is smaller than a gigabyte, then MongoDB will allocate 1 gigabyte of space. For 64-bit OS X systems, MongoDB allocates 183 megabytes of space to the oplog and for 32-bit systems, MongoDB allocates about 48 megabytes of space to the oplog.
How big are records and how many do you want? It depends on whether this data insertion is something typical or something abnormal that you were merely testing.
For example, at 2000 documents per second for documents of 1KB, that would net you 120MB per minute and your 5GB oplog would last about 40 minutes. This means if the secondary ever goes offline for 40 minutes or falls behind by more than that, then you are stale and have to do a full resync.
I recommend reading the Replica Set Internals document here. You have 4 members in your replica set, which is not recommended. You should have an odd number for the voting election (of primary) process, so you either need to add an arbiter, another secondary or remove one of your secondaries.
Finally, here's a detailed document on RS administration.

Related

In mongodb 3.0 replication, how elections happen when a secondary goes down

Situation: I have a MongoDB replication set over two computers.
One computer is a server that holds the primary node and the arbiter. This server is a live server and is always on. It's local IP that is used in replication is 192.168.0.4.
Second is a PC that the secondary node resides on and is on for a few hours a day. It's local IP that is used in replication is 192.168.0.5.
My expectation: I wanted the live server to be the main point of data interaction of my application, regardless of the state of the PC (whether it is reachable or not, since PC is secondary), so I wanted to make sure that server's node is always primary.
The following is the result of rs.config():
liveSet:PRIMARY> rs.config()
{
"_id" : "liveSet",
"version" : 2,
"members" : [
{
"_id" : 0,
"host" : "192.168.0.4:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 10,
"tags" : {
},
"slaveDelay" : 0,
"votes" : 1
},
{
"_id" : 1,
"host" : "192.168.0.5:5051",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : 0,
"votes" : 1
},
{
"_id" : 2,
"host" : "192.168.0.4:5052",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : 0,
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatTimeoutSecs" : 10,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
}
}
}
Also I have set the storage engine to be WiredTiger, if that matters.
What I actually get, and the problem: When I turn off the PC, or kill its mongod process, then the node on the server becomes secondary.
The following is the output of the server when I killed PC's mongod process, while connected to primary node's shell:
liveSet:PRIMARY>
2015-11-29T10:46:29.471+0430 I NETWORK Socket recv() errno:10053 An established connection was aborted by the software in your host machine. 127.0.0.1:27017
2015-11-29T10:46:29.473+0430 I NETWORK SocketException: remote: 127.0.0.1:27017 error: 9001 socket exception [RECV_ERROR] server [127.0.0.1:27017]
2015-11-29T10:46:29.475+0430 I NETWORK DBClientCursor::init call() failed
2015-11-29T10:46:29.479+0430 I NETWORK trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
2015-11-29T10:46:29.481+0430 I NETWORK reconnect 127.0.0.1:27017 (127.0.0.1) ok
liveSet:SECONDARY>
There are two doubts for me:
Considering this part of MongoDB documentation:
Replica sets use elections to determine which set member will become primary. Elections occur after initiating a replica set, and also any time the primary becomes unavailable.
The election occurs when the primary is not available (or at the time of initiating, however this is part does not concern our case), but primary was always available, so why the election happens.
Considering this part of the same documentation:
If a majority of the replica set is inaccessible or unavailable, the replica set cannot accept writes and all remaining members become read-only.
Considering the part 'members become read-only', I have two nodes up vs one down, so this should not also affect our replication.
Now my question: How to keep the node on the server as primary, when the node on PC is not reachable?
Update:
This is the output of rs.status().
Thanks to Wan Bachtiar, now This makes the behavior obvious, since arbiter was not reachable.
liveSet:PRIMARY> rs.status()
{
"set" : "liveSet",
"date" : ISODate("2015-11-30T04:33:03.864Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "192.168.0.4:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 1807553,
"optime" : Timestamp(1448796026, 1),
"optimeDate" : ISODate("2015-11-29T11:20:26Z"),
"electionTime" : Timestamp(1448857488, 1),
"electionDate" : ISODate("2015-11-30T04:24:48Z"),
"configVersion" : 2,
"self" : true
},
{
"_id" : 1,
"name" : "192.168.0.5:5051",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 496,
"optime" : Timestamp(1448796026, 1),
"optimeDate" : ISODate("2015-11-29T11:20:26Z"),
"lastHeartbeat" : ISODate("2015-11-30T04:33:03.708Z"),
"lastHeartbeatRecv" : ISODate("2015-11-30T04:33:02.451Z"),
"pingMs" : 1,
"configVersion" : 2
},
{
"_id" : 2,
"name" : "192.168.0.4:5052",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"lastHeartbeat" : ISODate("2015-11-30T04:33:00.008Z"),
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
"configVersion" : -1
}
],
"ok" : 1
}
liveSet:PRIMARY>

As stated in the documentation, if a majority of the replica set is inaccessible or unavailable, the replica set cannot accept writes and all remaining members become read-only.
In this case the primary has to step down if the arbiter and the secondary are not reachable. rs.status() should be able to determine the health of the replica members.
One thing you should also watch for is the primary oplog size. The size of the oplog determines how long a replica set member can be down for and still be able to catch up when it comes back online. The bigger the oplog size, the longer you can deal with a member being down for as the oplog can hold more operations. If it does fall too far behind, you must resynchronise the member by removing its data files and performing an initial sync.
See Check the size of the Oplog for more info.
Regards,
Wan.

Mongos routing with ReadPreference=NEAREST

I'm having trouble diagnosing an issue where my Java application's requests to MongoDB are not getting routed to the Nearest replica, and I hope someone can help. Let me start by explaining my configuration.
The Configuration:
I am running a MongoDB instance in production that is a Sharded ReplicaSet. It is currently only a single shard (it hasn't gotten big enough yet to require a split). This single shard is backed by a 3-node replica set. 2 nodes of the replica set live in our primary data center. The 3rd node lives in our secondary datacenter, and is prohibited from becoming the Master node.
We run our production application simultaneously in both data centers, however the instance in our secondary data center operates in "read-only" mode and never writes data into MongoDB. It only serves client requests for reads of existing data. The objective of this configuration is to ensure that if our primary datacenter goes down, we can still serve client read traffic.
We don't want to waste all of this hardware in our secondary datacenter, so even in happy times we actively load balance a portion of our read-only traffic to the instance of our application running in the secondary datacenter. This application instance is configured with readPreference=NEAREST and is pointed at a mongos instance running on localhost (version 2.6.7). The mongos instance is obviously configured to point at our 3-node replica set.
From a mongos:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 4,
"minCompatibleVersion" : 4,
"currentVersion" : 5,
"clusterId" : ObjectId("52a8932af72e9bf3caad17b5")
}
shards:
{ "_id" : "shard1", "host" : "shard1/failover1.com:27028,primary1.com:27028,primary2.com:27028" }
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard1" }
{ "_id" : "MyApplicationData", "partitioned" : false, "primary" : "shard1" }
From the failover node of the replicaset:
shard1:SECONDARY> rs.status()
{
"set" : "shard1",
"date" : ISODate("2015-09-03T13:26:18Z"),
"myState" : 2,
"syncingTo" : "primary1.com:27028",
"members" : [
{
"_id" : 3,
"name" : "primary1.com:27028",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 674841,
"optime" : Timestamp(1441286776, 2),
"optimeDate" : ISODate("2015-09-03T13:26:16Z"),
"lastHeartbeat" : ISODate("2015-09-03T13:26:16Z"),
"lastHeartbeatRecv" : ISODate("2015-09-03T13:26:18Z"),
"pingMs" : 49,
"electionTime" : Timestamp(1433952764, 1),
"electionDate" : ISODate("2015-06-10T16:12:44Z")
},
{
"_id" : 4,
"name" : "primary2.com:27028",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 674846,
"optime" : Timestamp(1441286777, 4),
"optimeDate" : ISODate("2015-09-03T13:26:17Z"),
"lastHeartbeat" : ISODate("2015-09-03T13:26:18Z"),
"lastHeartbeatRecv" : ISODate("2015-09-03T13:26:18Z"),
"pingMs" : 53,
"syncingTo" : "primary1.com:27028"
},
{
"_id" : 5,
"name" : "failover1.com:27028",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 8629159,
"optime" : Timestamp(1441286778, 1),
"optimeDate" : ISODate("2015-09-03T13:26:18Z"),
"self" : true
}
],
"ok" : 1
}
shard1:SECONDARY> rs.conf()
{
"_id" : "shard1",
"version" : 15,
"members" : [
{
"_id" : 3,
"host" : "primary1.com:27028",
"tags" : {
"dc" : "primary"
}
},
{
"_id" : 4,
"host" : "primary2.com:27028",
"tags" : {
"dc" : "primary"
}
},
{
"_id" : 5,
"host" : "failover1.com:27028",
"priority" : 0,
"tags" : {
"dc" : "failover"
}
}
],
"settings" : {
"getLastErrorModes" : {"ACKNOWLEDGED" : {}}
}
}
The Problem:
The problem is that requests which hit this mongos in our secondary datacenter seem to be getting routed to a replica running in our primary datacenter, not the nearest node, which is running in the secondary datacenter. This incurs a significant amount of network latency and results in bad read performance.
My understanding is that the mongos is deciding which node in the replica set to route the request to, and it's supposed to honor the ReadPreference from my java driver's request. Is there a command I can run in the mongos shell to see the status of the replica set, including ping times to nodes? Or some way to see logging of incoming requests which indicates the node in the replicaSet that was chosen and why? Any advice at all on how to diagnose the root cause of my issue?

While configuring read preference, when ReadPreference = NEAREST the system does not look for minimum network latency as it may decide primary as the nearest, if the network connection is proper. However, the nearest read mode, when combined with a tag set, selects the matching member with the lowest network latency. Even nearest may be any of primary or secondary. Behaviour of mongos when preferences configured , and in terms of network latency is not so clearly explained in the official docs.
http://docs.mongodb.org/manual/core/read-preference/#replica-set-read-preference-tag-sets
hope this helps

If I start mongos with flag -vvvv (4x verbose) then I am presented with request routing information in the log files, including information about the read preference used and the host to which requests were routed. for example:
2015-09-10T17:17:28.020+0000 [conn3] dbclient_rs say
using secondary or tagged node selection in shard1,
read pref is { pref: "nearest", tags: [ {} ] }
(primary : primary1.com:27028,
lastTagged : failover1.com:27028)

Despite the wording, when using nearest, the absolute fastest member isn't necessarily the one chosen. Instead, a random member is chosen out of a pool of members that have a latency lower than the calculated latency window.
The latency window is calculated by taking the fastest member's ping and adding replication.localPingThresholdMs, whose default is 15ms. You can read more about the algorithm here.
So what I do is I combine nearest with tags so that I can specify the member manually that I know is geographically closest.

Why a member of mongodb keep RECOVERING?

I set up a replica set with three members and one of them is an arbiter.
One time I restart a member, the member keep RECOVERING for a long time and did not be SECONDARY again, even though the database was not large.
The status of replica set is like that:
rs:PRIMARY> rs.status()
{
"set" : "rs",
"date" : ISODate("2013-01-17T02:08:57Z"),
"myState" : 1,
"members" : [
{
"_id" : 1,
"name" : "192.168.1.52:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 67968,
"optime" : Timestamp(1358388479000, 1),
"optimeDate" : ISODate("2013-01-17T02:07:59Z"),
"self" : true
},
{
"_id" : 2,
"name" : "192.168.1.50:29017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 107,
"lastHeartbeat" : ISODate("2013-01-17T02:08:56Z"),
"pingMs" : 0
},
{
"_id" : 3,
"name" : "192.168.1.50:27017",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 58,
"optime" : Timestamp(1358246732000, 100),
"optimeDate" : ISODate("2013-01-15T10:45:32Z"),
"lastHeartbeat" : ISODate("2013-01-17T02:08:55Z"),
"pingMs" : 0,
"errmsg" : "still syncing, not yet to minValid optime 50f6472f:5d"
}
],
"ok" : 1
}
How should I fix this problem?

I had exact same issue: Secondary member of replica stuck in recovering mode.
Here how to solve the issue:
stop secondary mongo db
delete all secondary db data files
start secondary mongo
It will start in startup2 mode and will replicate all data from Primary

I've fixed the issue by following the below procedure.
Step1:
Login to different node and remove the issue node from mongodb replicaset. eg.
rs.remove("10.x.x.x:27017")
Step 2:
Stop the mongodb server on the issue node
systemctl stop mongodb.service
Step 3:
Create a new new folder on the dbpath
mkdir /opt/mongodb/data/db1
Note : existing path was /opt/mongodb/data/db
Step 4:
Modify dbpath on /etc/mongod.conf or mongdb yaml file
dbPath: /opt/mongodb/data/db1
Step 5:
Start the mongodb service
systemctl start mongodb.service
Step 6:
Takebackup of the existing folder and remove it
mkdir /opt/mongodb/data/backup
mv /opt/mongodb/data/db/* /opt/mongodb/data/backup
tar -cvf /opt/mongodb/data/backup.tar.gz /opt/mongodb/data/backup
rm -rf /opt/mongodb/data/db/

This will happen if replication has been broken for a while and on the slave it's not enough data to resume replication.
You would have to re-sync the slave either by replicating data from scratch or by copying it from another server and then resume it.

Check mongodb documentation for this issue https://docs.mongodb.com/manual/tutorial/resync-replica-set-member/#replica-set-auto-resync-stale-member

Should I increase the size of my MongoDB oplog file?

I understand that the oplog file will split multi updates into individual updates but what about batch inserts? Are those also split into individual inserts?
If I have a write intensive collection with batches of ~20K docs being inserted roughly every 30 seconds, do I / should I consider increasing my oplog size beyond the default? I have a 3 member replica set and mongod is running on a 64 bit Ubuntu server install with the Mongodb data sitting on a 100GB volume.
Here is some data which may or may not be helpful:
gs_rset:PRIMARY> db.getReplicationInfo()
{
"logSizeMB" : 4591.3134765625,
"usedMB" : 3434.63,
"timeDiff" : 68064,
"timeDiffHours" : 18.91,
"tFirst" : "Wed Oct 24 2012 22:35:10 GMT+0000 (UTC)",
"tLast" : "Thu Oct 25 2012 17:29:34 GMT+0000 (UTC)",
"now" : "Fri Oct 26 2012 19:42:19 GMT+0000 (UTC)"
}
gs_rset:PRIMARY> rs.status()
{
"set" : "gs_rset",
"date" : ISODate("2012-10-26T19:44:00Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "xxxx:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 77531,
"optime" : Timestamp(1351186174000, 1470),
"optimeDate" : ISODate("2012-10-25T17:29:34Z"),
"self" : true
},
{
"_id" : 1,
"name" : "xxxx:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 76112,
"optime" : Timestamp(1351186174000, 1470),
"optimeDate" : ISODate("2012-10-25T17:29:34Z"),
"lastHeartbeat" : ISODate("2012-10-26T19:44:00Z"),
"pingMs" : 1
},
{
"_id" : 2,
"name" : "xxxx:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 61301,
"optime" : Timestamp(1351186174000, 1470),
"optimeDate" : ISODate("2012-10-25T17:29:34Z"),
"lastHeartbeat" : ISODate("2012-10-26T19:43:59Z"),
"pingMs" : 1
}
],
"ok" : 1
}
gs_rset:PRIMARY> db.printCollectionStats()
dev_fbinsights
{
"ns" : "dev_stats.dev_fbinsights",
"count" : 6556181,
"size" : 3117699832,
"avgObjSize" : 475.53596095043747,
"storageSize" : 3918532608,
"numExtents" : 22,
"nindexes" : 2,
"lastExtentSize" : 1021419520,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 0,
"totalIndexSize" : 1150346848,
"indexSizes" : {
"_id_" : 212723168,
"fbfanpage_id_1_date_1_data.id_1" : 937623680
},
"ok" : 1
}

The larger the size of the current primary's oplog, the longer the window of time a replica set member will be able to remain offline without falling too far behind the primary. If it does fall too far behind, it will need a full resync.
The field timeDiffHours as returned by db.getReplicationInfo() reports how many hours worth of data the oplog currently has recorded. After the oplog has filled up and starts overwriting old entries, then start to monitor this value. Do so especially under heavy write load (in which the value will decrease). If you then assume it will never drop below N hours, then N is the maximum number of hours that you can tolerate a replica set member being temporarily offline (e.g. for regular maintenance, or to make an offline backup, or in the event of hardware failure) without performing the full resync. The member would then be able to automatically catch up to the primary after coming back online.
If you're not comfortable with how low N is, then you should increase the size of the oplog. It completely depends on long your maintenance windows are, or how quickly you or your ops team can respond to disaster scenarios. Be liberal in how much disk space you allocate for it, unless you have a compelling need for that space.
I'm assuming here that you're keeping the size of the oplog constant over all replica set members, which is a reasonable thing to do. If not, then plan for the scenario where the replica set member with the smallest oplog gets elected primary.
(To answer your other question: similarly to multi-updates, batch inserts are also fanned out into multiple operations in the oplog)
Edit: Note that data imports and bulk inserts/updates will write data significantly faster to the oplog than your application might at a typical heavy load. To reiterate: be conservative in your estimation for how much time it will take for the oplog to fill.

Shards are not equally balanced in cluster

I have 2 shards.
One is over standalone server and another over replicaset:
mongos> db.runCommand({listshards:1})
{
"shards" : [
{
"_id" : "shard0000",
"host" : "mongo3:10001"
},
{
"_id" : "set1",
"host" : "set1/mongo1:10001,mongo2:10001"
}
],
"ok" : 1
}
I've inserted about 30M records.
As far as I understand mongo should balance equally the data between the shards, but it does not happen:
mongos> db.stats()
{
"raw" : {
"set1/mongo1:10001,mongo2:10001" : {
"db" : "my_ginger",
"collections" : 3,
"objects" : 5308714,
"avgObjSize" : 811.9953284354742,
"dataSize" : 4310650968,
"storageSize" : 4707774464,
"numExtents" : 23,
"indexes" : 2,
"indexSize" : 421252048,
"fileSize" : 10666115072,
"nsSizeMB" : 16,
"ok" : 1
},
"mongo3:10001" : {
"db" : "my_ginger",
"collections" : 6,
"objects" : 25162626,
"avgObjSize" : 1081.6777010475776,
"dataSize" : 27217851444,
"storageSize" : 28086624096,
"numExtents" : 38,
"indexes" : 6,
"indexSize" : 1903266512,
"fileSize" : 34276900864,
"nsSizeMB" : 16,
"ok" : 1
}
},
"objects" : 30471340,
"avgObjSize" : 1034.6936633571088,
"dataSize" : 31528502412,
"storageSize" : 32794398560,
"numExtents" : 61,
"indexes" : 8,
"indexSize" : 2324518560,
"fileSize" : 44943015936,
"ok" : 1
}
What am I doing wrong?
Thanks.

According the sh.status() output in the comments, you have 164 chunks on shard0000 (the single host) and 85 on set1 (the replica set). There are a couple of common reasons that this kind of imbalance can happen:
You picked a bad shard key (monotonically increasing or similar)
All your data was initially on a single shard and is being rebalanced
The balancer will be continuously attempting to move chunks from the high shard to the low while at the same time moving the max-chunk around (for people that pick the aforementioned monotonically increasing keys, this helps). However, there can only be one migration at the time, so this will take some time, especially if you continue writing/reading from the shards at the same time. If things are really bad, and you did pick a poor shard key, then this may persist for some time.
If all your data was on one shard first, and then you added another, then you have a similar problem - it will take a while for the chunk count to stabilise because half the data has to be moved from the original shard (in addition to its other activities) to balance things out. The balancer will pick low range chunks to move first in general, so if these are less likely to be in memory (back to the poor shard key again), then they will have to be paged in before they can be migrated.
To check the balancer is running:
http://docs.mongodb.org/manual/reference/method/sh.setBalancerState/#sh.getBalancerState
Then, to see what it has been up to, connect to a mongos (last 10 operations):
use config
db.changelog.find().sort({$natural:-1}).limit(10).pretty()
Similarly you will see messaging in the primary logs of each shard relating to the migrations, how long they are taking etc. if you want to see their performance.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse