mongodb balancer does not move chunks because of obscure error - mongodb

After carefully following this guide, on how to migrate a 3 member replica set to a shard containing 2 X 3 member replica sets, after sharding one of my collections nothing happened. Looking into the logs, i got this:
2014-06-22T16:49:55.467+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' unlocked.
2014-06-22T16:50:01.830+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' acquired, ts : 53a70939025c137381ef2126
2014-06-22T16:50:01.945+0000 [Balancer] ns: database.emailevents going to move { _id: "database.emailevents-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('53a6d810089f342ad32992d4'), ns: "database.emailevents", min: { _id: MinKey }, max: { _id: -9199836772564449863 }, shard: "replicaset2" } from: replicaset2 to: replicaset4 tag []
2014-06-22T16:50:01.945+0000 [Balancer] moving chunk ns: database.emailevents moving ( ns: database.emailevents, shard: replicaset2:replicaset2/vt-mongo-4:27017,vt-mongo-5:27017,vt-mongo-6:27017, lastmod: 1|0||000000000000000000000000, min: { _id: MinKey }, max: { _id: -9199836772564449863 }) replicaset2:database_replicaset2/vt-mongo-4:27017,vt-mongo-5:27017,vt-mongo-6:27017 -> replicaset4:replicaset4/vt-mongo-1:27017,vt-mongo-2:27017,vt-mongo-3:27017
2014-06-22T16:50:01.954+0000 [Balancer] moveChunk result: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 }
2014-06-22T16:50:01.955+0000 [Balancer] balancer move failed: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 } from: replicaset2 to: replicaset4 chunk: min: { _id: MinKey } max: { _id: -9199836772564449863 }
2014-06-22T16:50:02.166+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' unlocked.
generally, this error:
result: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 }
is the result of the machines not being able to talk to one-eachother, but this isn't the case. I have manually checked and all the machines are able to talk to eachother.

Related

MongoDB chunk split and jumbo chunks

Using a MongoDB default setup with max chunk size of 64Mb and max document size of 16Mb (https://docs.mongodb.org/manual/reference/limits/).
I had the issue of a chunk to big that needed to be split but the split failed.
I don't understand why. Let's assume we have big documents, if a chunk is created with a document (16Mb), then a new document comes (32Mb), and so on, until we reach 64Mb, then the split process occurs. Why can it fail? why don't split the chunk in two halves of 2 16Mb documents?
Or in the case the documents are like:
10Mb + 16Mb + 16Mb + 16Mb + 16Mb = 74Mb (needs to split)
It can do:
10Mb + 16Mb + 16Mb = 42Mb
16Mb + 16Mb = 32Mb
Why is this failing?
MongoDB log:
2016-04-06T12:10:53.424-0700 I SHARDING [Balancer] ns: reports.storePoints going to move { id: "reports.storePoints-storeId-9150629136201875191", ns: "reports.storePoints", min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }, version: Timestamp 666000|1, versionEpoch: ObjectId('54ac908475d02f0bdb362171'), lastmod: Timestamp 666000|1, lastmodEpoch: ObjectId('54ac908475d02f0bdb362171'), shard: "rs1" } from: rs1 to: rs2 tag []
2016-04-06T12:10:53.477-0700 I SHARDING [Balancer] moving chunk ns: reports.storePoints moving ( ns: reports.storePoints, shard: rs1:rs1/host1:27017,host2:27017, lastmod: 666|1||000000000000000000000000, min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }) rs1:rs1/host1:27017,host1:27017 -> rs2:rs2/host2:27017,host2:27017
2016-04-06T12:10:54.052-0700 I SHARDING [Balancer] moveChunk result: { chunkTooBig: true, estimatedChunkSize: 111379744, ok: 0.0, errmsg: "chunk too big to move", $gleStats: { lastOpTime: Timestamp 1455673163000|2, electionId: ObjectId('55b838ecf80a354067abdbbf') } }
2016-04-06T12:10:54.053-0700 I SHARDING [Balancer] balancer move failed: { chunkTooBig: true, estimatedChunkSize: 111379744, ok: 0.0, errmsg: "chunk too big to move", $gleStats: { lastOpTime: Timestamp 1455673163000|2, electionId: ObjectId('55b838ecf80a354067abdbbf') } } from: rs1 to: rs2 chunk: min: { storeId: -9150629136201875191 } max: { storeId: -9148794391694793543 }
2016-04-06T12:10:54.053-0700 I SHARDING [Balancer] performing a split because migrate failed for size reasons
2016-04-06T12:10:54.063-0700 I SHARDING [Balancer] split results: CannotSplit chunk not full enough to trigger auto-split
2016-04-06T12:10:54.063-0700 I SHARDING [Balancer] marking chunk as jumbo: ns: reports.storePoints, shard: rs1:rs1/host1:27017,host2:27017, lastmod: 666|1||000000000000000000000000, min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }

Mongodb balancing very slow

We are experiencing very slow balancing in our cluster. On our log, it seems that migration progress barely makes progress:
2016-01-25T22:21:15.907-0600 I SHARDING [conn142] moveChunk data transfer progress: { active: true, ns: "music.fav_artist_score", from: "rs1/MONGODB01-SRV:27017,MONGODB05-SRV:27017", min: { _id.u: -9159729253516193447 }, max: { _id.u: -9157438072680830290 }, shardKeyPattern: { _id.u: "hashed" }, state: "clone", counts: { cloned: 128, clonedBytes: 12419, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
2016-01-25T22:21:16.932-0600 I SHARDING [conn142] moveChunk data transfer progress: { active: true, ns: "music.fav_artist_score", from: "rs1/MONGODB01-SRV:27017,MONGODB05-SRV:27017", min: { _id.u: -9159729253516193447 }, max: { _id.u: -9157438072680830290 }, shardKeyPattern: { _id.u: "hashed" }, state: "clone", counts: { cloned: 128, clonedBytes: 12419, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
2016-01-25T22:21:17.957-0600 I SHARDING [conn142] moveChunk data transfer progress: { active: true, ns: "music.fav_artist_score", from: "rs1/MONGODB01-SRV:27017,MONGODB05-SRV:27017", min: { _id.u: -9159729253516193447 }, max: { _id.u: -9157438072680830290 }, shardKeyPattern: { _id.u: "hashed" }, state: "clone", counts: { cloned: 128, clonedBytes: 12419, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Also, when we shard a new collection. It initially only starts with 8 chunks in the same primary replica set. It does not migrating chunks to other shards
Our configuration is 4 replica sets of (primary, secondary, arbiter) & 3 configs in a replica set. Both sh.getBalancerState() & sh.isBalancerRunning() return true
In MongoDB sharding performance depends upon the key chose for sharding the database. Since, your chunks are always stored on a single node it is highly probable that the shard key you have chosen is monotonically increasing. To avoid this issue, hash the key to allow proper balance of chunks across all the shards. Use the following command for hashed sharding.
sh.shardCollection( "<your-db>", { <shard-key>: "hashed" } )

Mongo Auto Balancing Not Working

I'm running into an issue where one of my shards is constantly at 100% CPU usage while I'm storing files into my Mongo DB (using Grid FS). I have shutdown writing to the DB and the usage does drop down to nearly 0%. However, the auto balancer is on and does not appear to be auto balancing anything. I have roughly 50% of my data on that one shard with nearly 100% CPU usage and virtually all the others are at 7-8%.
Any ideas?
mongos> version()
3.0.6
Auto Balancing Enabled
Storage Engine: WiredTiger
I have this general architecture:
2 - routers
3 - config server
8 - shards (2 shards per server - 4 servers)
No replica sets!
https://docs.mongodb.org/v3.0/core/sharded-cluster-architectures-production/
Log Details
Router 1 Log:
2016-01-15T16:15:21.714-0700 I NETWORK [conn3925104] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] Socket recv() timeout [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] SocketException: remote: [IP]:[port] error: 9001 socket exception [RECV_TIMEOUT] server [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] DBClientCursor::init call() failed
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] scoped connection to [IP]:[port],[IP]:[port],[IP]:[port] not being returned to the pool
2016-01-15T16:15:23.256-0700 W SHARDING [LockPinger] distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579303:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed: [IP]:[port] (IP) failed:10276 DBClientBase::findN: transport error: [IP]:[port] ns: admin.$cmd query: { getlasterror: 1, fsync: 1 }
2016-01-15T16:15:24.715-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925105 (64 connections now open)
2016-01-15T16:15:24.715-0700 I NETWORK [conn3925105] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:27.717-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925106 (64 connections now open)
2016-01-15T16:15:27.718-0700 I NETWORK [conn3925106] end connection [IP]:[port](63 connections now open)
Router 2 Log:
2016-01-15T16:18:21.762-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e3d110ccb8e38549a9d
2016-01-15T16:18:24.316-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:24 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:18:24.978-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:35.295-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e4a110ccb8e38549a9f
2016-01-15T16:18:38.507-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:48.838-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e58110ccb8e38549aa1
2016-01-15T16:18:52.038-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:54.660-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:54 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:19:02.323-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e66110ccb8e38549aa3
2016-01-15T16:19:05.513-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
Problematic Shard Log:
2016-01-15T16:21:03.426-0700 W SHARDING [conn40] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 46 took 17364ms
2016-01-15T16:21:03.484-0700 I COMMAND [conn40] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 17421ms
2016-01-15T16:21:03.775-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:21:03 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1441718306:765353801', sleeping for 30000ms
2016-01-15T16:21:04.321-0700 I SHARDING [conn40] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:08.243-0700 I SHARDING [conn46] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:10.174-0700 W SHARDING [conn37] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 60 took 18516ms
2016-01-15T16:21:10.232-0700 I COMMAND [conn37] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18574ms
2016-01-15T16:21:10.989-0700 W SHARDING [conn25] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 62 took 18187ms
2016-01-15T16:21:11.047-0700 I COMMAND [conn25] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18246ms
2016-01-15T16:21:11.365-0700 I SHARDING [conn37] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
For the splitting error - Upgrading to Mongo v.3.0.8+ resolved it
Still having an issue with the balancing itself...shard key is an md5 check sum so unless they all have very similar md5s (not very likely) there is still investigating to do....using range based partitioning
there are multiple ways to check
db.printShardingStatus() - this will give all collections sharded and whether auto balancer is on and current collection taken for sharding from when
sh.status(true) - this will give chunk level details. Look whether your chunk has jumbo:true . In case chunk is marked as jumbo it will not be split properly.
db.collection.stats() -- this will give collection stats and see each shard distribution details there

how to enable sharding in test environment

How do I enable sharding in test environment? Here I am sharing what i have done till now, I have one config server:
Config server1: Host-a:27019
One mongos instance on same machine on port 27017
and two mongod instance shard:
Host-a:27020
host-b:27021
When I am enabling sharding on collection it gives me this error:
2016-01-12T10:31:07.522Z I SHARDING [Balancer] ns: feedproductsdata.merchantproducts going to move { _id: "feedproductsdata.merchantproducts-product_id_MinKey", ns: "feedproductsdata.merchantproducts", min: { product_id: MinKey }, max: { product_id: 0 }, version: Timestamp 1000|0, versionEpoch: ObjectId('5694d57ebe78315b68519c38'), lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('5694d57ebe78315b68519c38'), shard: "shard0001" } from: shard0001 to: shard0000 tag []
2016-01-12T10:31:07.523Z I SHARDING [Balancer] moving chunk ns: feedproductsdata.merchantproducts moving ( ns: feedproductsdata.merchantproducts, shard: shard0001:192.168.1.12:27021, lastmod: 1|0||000000000000000000000000, min: { product_id: MinKey }, max: { product_id: 0 }) shard0001:192.168.1.12:27021 -> shard0000:192.168.1.8:27020
2016-01-12T10:31:08.530Z I SHARDING [Balancer] moveChunk result: { errmsg: "exception: socket exception [CONNECT_ERROR] for cfg1.server.com:27019", code: 11002, ok: 0.0 }
2016-01-12T10:31:08.531Z I SHARDING [Balancer] balancer move failed: { errmsg: "exception: socket exception [CONNECT_ERROR] for cfg1.server.com:27019", code: 11002, ok: 0.0 } from: shard0001 to: shard0000 chunk: min: { product_id: MinKey } max: { product_id: 0 }
2016-01-12T10:31:08.604Z I SHARDING [Balancer] distributed lock 'balancer/Knowledgeops-PC:27017:1452594322:41' unlocked.

server crushed when using mongodb mapreduce

I'm using replication set with 3 members, this is the code for firing mapreduce
var db=new mongodb.Db('sns',replSet,{"readPreference":"secondaryPreferred", "safe":true});
....
collection.mapReduce(Account_Map,Account_Reduce,{out:{'replace': 'log_account'},query:queryObj},function(err,collection){};
Then my primary died and restarted, but voted becoming a secondary, and there remains a collection sns.tmp.mr.account_0 instead of log_account. I'm very new to mongodb, I really want to figure out what the promblem is.
2015-02-06T14:04:34.443+0800 [conn87299] build index on: sns.tmp.mr.account_0_inc properties: { v: 1, key: { 0: 1 }, name: "_temp_0", ns: "sns.tmp.mr.account_0_inc" }
2015-02-06T14:04:34.443+0800 [conn87299]added index to empty collection
2015-02-06T14:04:34.457+0800 [conn87299] build index on: sns.tmp.mr.account_0 properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "sns.tmp.mr.account_0" }