MongoDB chunk split and jumbo chunks - mongodb

Using a MongoDB default setup with max chunk size of 64Mb and max document size of 16Mb (https://docs.mongodb.org/manual/reference/limits/).
I had the issue of a chunk to big that needed to be split but the split failed.
I don't understand why. Let's assume we have big documents, if a chunk is created with a document (16Mb), then a new document comes (32Mb), and so on, until we reach 64Mb, then the split process occurs. Why can it fail? why don't split the chunk in two halves of 2 16Mb documents?
Or in the case the documents are like:
10Mb + 16Mb + 16Mb + 16Mb + 16Mb = 74Mb (needs to split)
It can do:
10Mb + 16Mb + 16Mb = 42Mb
16Mb + 16Mb = 32Mb
Why is this failing?
MongoDB log:
2016-04-06T12:10:53.424-0700 I SHARDING [Balancer] ns: reports.storePoints going to move { id: "reports.storePoints-storeId-9150629136201875191", ns: "reports.storePoints", min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }, version: Timestamp 666000|1, versionEpoch: ObjectId('54ac908475d02f0bdb362171'), lastmod: Timestamp 666000|1, lastmodEpoch: ObjectId('54ac908475d02f0bdb362171'), shard: "rs1" } from: rs1 to: rs2 tag []
2016-04-06T12:10:53.477-0700 I SHARDING [Balancer] moving chunk ns: reports.storePoints moving ( ns: reports.storePoints, shard: rs1:rs1/host1:27017,host2:27017, lastmod: 666|1||000000000000000000000000, min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }) rs1:rs1/host1:27017,host1:27017 -> rs2:rs2/host2:27017,host2:27017
2016-04-06T12:10:54.052-0700 I SHARDING [Balancer] moveChunk result: { chunkTooBig: true, estimatedChunkSize: 111379744, ok: 0.0, errmsg: "chunk too big to move", $gleStats: { lastOpTime: Timestamp 1455673163000|2, electionId: ObjectId('55b838ecf80a354067abdbbf') } }
2016-04-06T12:10:54.053-0700 I SHARDING [Balancer] balancer move failed: { chunkTooBig: true, estimatedChunkSize: 111379744, ok: 0.0, errmsg: "chunk too big to move", $gleStats: { lastOpTime: Timestamp 1455673163000|2, electionId: ObjectId('55b838ecf80a354067abdbbf') } } from: rs1 to: rs2 chunk: min: { storeId: -9150629136201875191 } max: { storeId: -9148794391694793543 }
2016-04-06T12:10:54.053-0700 I SHARDING [Balancer] performing a split because migrate failed for size reasons
2016-04-06T12:10:54.063-0700 I SHARDING [Balancer] split results: CannotSplit chunk not full enough to trigger auto-split
2016-04-06T12:10:54.063-0700 I SHARDING [Balancer] marking chunk as jumbo: ns: reports.storePoints, shard: rs1:rs1/host1:27017,host2:27017, lastmod: 666|1||000000000000000000000000, min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }

Related

MongoDB ttl index behave unexpected

We currently have a ttl index on a timestamp in MongoDB.
Before we had an ttl expireAfterSeconds : 604800 seconds -> 1 week of documents.
A few weeks ago we change this to a TTL index of 31536000 seconds -> 365 days.
But MongoDB still removes documents after 7 days instead of 365.
MongoDB version: 4.2.18
MongoDB running hosted Atlas / AWS
Indexes in database:
{
v: 2,
key: { _id: 1 },
name: '_id_',
ns: '<DB>.<COLLECTION>'
},
{
v: 2,
key: { id: 1, ts: 1 },
name: 'id_1_ts_1',
ns: '<DB>.<COLLECTION>
},
{
v: 2,
key: { ts: 1 },
name: 'ts_1',
ns: '<DB>.<COLLECTION>',
expireAfterSeconds: 31536000
}
]
We have an environmental variable to set the TTL index on each start of the application,
the code to set the ttl index looks like this:
await collection.createIndex(
{ ts: 1 },
{
expireAfterSeconds: ENV.<Number>
}
);
} catch (e) {
if ((e as any).codeName == "IndexOptionsConflict") {
await collection.dropIndex("ts_1");
await collection.createIndex(
{ ts: 1 },
{
expireAfterSeconds: ENV.<Number>
}
);
}
}
as seen from the indexes, the TTL index should remove documents after one year?
Why does it behave like this? Any insights?
I would look into the collMod command

Mongodb balancing very slow

We are experiencing very slow balancing in our cluster. On our log, it seems that migration progress barely makes progress:
2016-01-25T22:21:15.907-0600 I SHARDING [conn142] moveChunk data transfer progress: { active: true, ns: "music.fav_artist_score", from: "rs1/MONGODB01-SRV:27017,MONGODB05-SRV:27017", min: { _id.u: -9159729253516193447 }, max: { _id.u: -9157438072680830290 }, shardKeyPattern: { _id.u: "hashed" }, state: "clone", counts: { cloned: 128, clonedBytes: 12419, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
2016-01-25T22:21:16.932-0600 I SHARDING [conn142] moveChunk data transfer progress: { active: true, ns: "music.fav_artist_score", from: "rs1/MONGODB01-SRV:27017,MONGODB05-SRV:27017", min: { _id.u: -9159729253516193447 }, max: { _id.u: -9157438072680830290 }, shardKeyPattern: { _id.u: "hashed" }, state: "clone", counts: { cloned: 128, clonedBytes: 12419, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
2016-01-25T22:21:17.957-0600 I SHARDING [conn142] moveChunk data transfer progress: { active: true, ns: "music.fav_artist_score", from: "rs1/MONGODB01-SRV:27017,MONGODB05-SRV:27017", min: { _id.u: -9159729253516193447 }, max: { _id.u: -9157438072680830290 }, shardKeyPattern: { _id.u: "hashed" }, state: "clone", counts: { cloned: 128, clonedBytes: 12419, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Also, when we shard a new collection. It initially only starts with 8 chunks in the same primary replica set. It does not migrating chunks to other shards
Our configuration is 4 replica sets of (primary, secondary, arbiter) & 3 configs in a replica set. Both sh.getBalancerState() & sh.isBalancerRunning() return true
In MongoDB sharding performance depends upon the key chose for sharding the database. Since, your chunks are always stored on a single node it is highly probable that the shard key you have chosen is monotonically increasing. To avoid this issue, hash the key to allow proper balance of chunks across all the shards. Use the following command for hashed sharding.
sh.shardCollection( "<your-db>", { <shard-key>: "hashed" } )

Mongo Auto Balancing Not Working

I'm running into an issue where one of my shards is constantly at 100% CPU usage while I'm storing files into my Mongo DB (using Grid FS). I have shutdown writing to the DB and the usage does drop down to nearly 0%. However, the auto balancer is on and does not appear to be auto balancing anything. I have roughly 50% of my data on that one shard with nearly 100% CPU usage and virtually all the others are at 7-8%.
Any ideas?
mongos> version()
3.0.6
Auto Balancing Enabled
Storage Engine: WiredTiger
I have this general architecture:
2 - routers
3 - config server
8 - shards (2 shards per server - 4 servers)
No replica sets!
https://docs.mongodb.org/v3.0/core/sharded-cluster-architectures-production/
Log Details
Router 1 Log:
2016-01-15T16:15:21.714-0700 I NETWORK [conn3925104] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] Socket recv() timeout [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] SocketException: remote: [IP]:[port] error: 9001 socket exception [RECV_TIMEOUT] server [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] DBClientCursor::init call() failed
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] scoped connection to [IP]:[port],[IP]:[port],[IP]:[port] not being returned to the pool
2016-01-15T16:15:23.256-0700 W SHARDING [LockPinger] distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579303:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed: [IP]:[port] (IP) failed:10276 DBClientBase::findN: transport error: [IP]:[port] ns: admin.$cmd query: { getlasterror: 1, fsync: 1 }
2016-01-15T16:15:24.715-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925105 (64 connections now open)
2016-01-15T16:15:24.715-0700 I NETWORK [conn3925105] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:27.717-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925106 (64 connections now open)
2016-01-15T16:15:27.718-0700 I NETWORK [conn3925106] end connection [IP]:[port](63 connections now open)
Router 2 Log:
2016-01-15T16:18:21.762-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e3d110ccb8e38549a9d
2016-01-15T16:18:24.316-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:24 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:18:24.978-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:35.295-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e4a110ccb8e38549a9f
2016-01-15T16:18:38.507-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:48.838-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e58110ccb8e38549aa1
2016-01-15T16:18:52.038-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:54.660-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:54 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:19:02.323-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e66110ccb8e38549aa3
2016-01-15T16:19:05.513-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
Problematic Shard Log:
2016-01-15T16:21:03.426-0700 W SHARDING [conn40] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 46 took 17364ms
2016-01-15T16:21:03.484-0700 I COMMAND [conn40] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 17421ms
2016-01-15T16:21:03.775-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:21:03 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1441718306:765353801', sleeping for 30000ms
2016-01-15T16:21:04.321-0700 I SHARDING [conn40] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:08.243-0700 I SHARDING [conn46] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:10.174-0700 W SHARDING [conn37] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 60 took 18516ms
2016-01-15T16:21:10.232-0700 I COMMAND [conn37] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18574ms
2016-01-15T16:21:10.989-0700 W SHARDING [conn25] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 62 took 18187ms
2016-01-15T16:21:11.047-0700 I COMMAND [conn25] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18246ms
2016-01-15T16:21:11.365-0700 I SHARDING [conn37] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
For the splitting error - Upgrading to Mongo v.3.0.8+ resolved it
Still having an issue with the balancing itself...shard key is an md5 check sum so unless they all have very similar md5s (not very likely) there is still investigating to do....using range based partitioning
there are multiple ways to check
db.printShardingStatus() - this will give all collections sharded and whether auto balancer is on and current collection taken for sharding from when
sh.status(true) - this will give chunk level details. Look whether your chunk has jumbo:true . In case chunk is marked as jumbo it will not be split properly.
db.collection.stats() -- this will give collection stats and see each shard distribution details there

how to enable sharding in test environment

How do I enable sharding in test environment? Here I am sharing what i have done till now, I have one config server:
Config server1: Host-a:27019
One mongos instance on same machine on port 27017
and two mongod instance shard:
Host-a:27020
host-b:27021
When I am enabling sharding on collection it gives me this error:
2016-01-12T10:31:07.522Z I SHARDING [Balancer] ns: feedproductsdata.merchantproducts going to move { _id: "feedproductsdata.merchantproducts-product_id_MinKey", ns: "feedproductsdata.merchantproducts", min: { product_id: MinKey }, max: { product_id: 0 }, version: Timestamp 1000|0, versionEpoch: ObjectId('5694d57ebe78315b68519c38'), lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('5694d57ebe78315b68519c38'), shard: "shard0001" } from: shard0001 to: shard0000 tag []
2016-01-12T10:31:07.523Z I SHARDING [Balancer] moving chunk ns: feedproductsdata.merchantproducts moving ( ns: feedproductsdata.merchantproducts, shard: shard0001:192.168.1.12:27021, lastmod: 1|0||000000000000000000000000, min: { product_id: MinKey }, max: { product_id: 0 }) shard0001:192.168.1.12:27021 -> shard0000:192.168.1.8:27020
2016-01-12T10:31:08.530Z I SHARDING [Balancer] moveChunk result: { errmsg: "exception: socket exception [CONNECT_ERROR] for cfg1.server.com:27019", code: 11002, ok: 0.0 }
2016-01-12T10:31:08.531Z I SHARDING [Balancer] balancer move failed: { errmsg: "exception: socket exception [CONNECT_ERROR] for cfg1.server.com:27019", code: 11002, ok: 0.0 } from: shard0001 to: shard0000 chunk: min: { product_id: MinKey } max: { product_id: 0 }
2016-01-12T10:31:08.604Z I SHARDING [Balancer] distributed lock 'balancer/Knowledgeops-PC:27017:1452594322:41' unlocked.

mongodb balancer does not move chunks because of obscure error

After carefully following this guide, on how to migrate a 3 member replica set to a shard containing 2 X 3 member replica sets, after sharding one of my collections nothing happened. Looking into the logs, i got this:
2014-06-22T16:49:55.467+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' unlocked.
2014-06-22T16:50:01.830+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' acquired, ts : 53a70939025c137381ef2126
2014-06-22T16:50:01.945+0000 [Balancer] ns: database.emailevents going to move { _id: "database.emailevents-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('53a6d810089f342ad32992d4'), ns: "database.emailevents", min: { _id: MinKey }, max: { _id: -9199836772564449863 }, shard: "replicaset2" } from: replicaset2 to: replicaset4 tag []
2014-06-22T16:50:01.945+0000 [Balancer] moving chunk ns: database.emailevents moving ( ns: database.emailevents, shard: replicaset2:replicaset2/vt-mongo-4:27017,vt-mongo-5:27017,vt-mongo-6:27017, lastmod: 1|0||000000000000000000000000, min: { _id: MinKey }, max: { _id: -9199836772564449863 }) replicaset2:database_replicaset2/vt-mongo-4:27017,vt-mongo-5:27017,vt-mongo-6:27017 -> replicaset4:replicaset4/vt-mongo-1:27017,vt-mongo-2:27017,vt-mongo-3:27017
2014-06-22T16:50:01.954+0000 [Balancer] moveChunk result: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 }
2014-06-22T16:50:01.955+0000 [Balancer] balancer move failed: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 } from: replicaset2 to: replicaset4 chunk: min: { _id: MinKey } max: { _id: -9199836772564449863 }
2014-06-22T16:50:02.166+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' unlocked.
generally, this error:
result: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 }
is the result of the machines not being able to talk to one-eachother, but this isn't the case. I have manually checked and all the machines are able to talk to eachother.