I'm running into an issue where one of my shards is constantly at 100% CPU usage while I'm storing files into my Mongo DB (using Grid FS). I have shutdown writing to the DB and the usage does drop down to nearly 0%. However, the auto balancer is on and does not appear to be auto balancing anything. I have roughly 50% of my data on that one shard with nearly 100% CPU usage and virtually all the others are at 7-8%.
Any ideas?
mongos> version()
3.0.6
Auto Balancing Enabled
Storage Engine: WiredTiger
I have this general architecture:
2 - routers
3 - config server
8 - shards (2 shards per server - 4 servers)
No replica sets!
https://docs.mongodb.org/v3.0/core/sharded-cluster-architectures-production/
Log Details
Router 1 Log:
2016-01-15T16:15:21.714-0700 I NETWORK [conn3925104] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] Socket recv() timeout [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] SocketException: remote: [IP]:[port] error: 9001 socket exception [RECV_TIMEOUT] server [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] DBClientCursor::init call() failed
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] scoped connection to [IP]:[port],[IP]:[port],[IP]:[port] not being returned to the pool
2016-01-15T16:15:23.256-0700 W SHARDING [LockPinger] distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579303:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed: [IP]:[port] (IP) failed:10276 DBClientBase::findN: transport error: [IP]:[port] ns: admin.$cmd query: { getlasterror: 1, fsync: 1 }
2016-01-15T16:15:24.715-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925105 (64 connections now open)
2016-01-15T16:15:24.715-0700 I NETWORK [conn3925105] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:27.717-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925106 (64 connections now open)
2016-01-15T16:15:27.718-0700 I NETWORK [conn3925106] end connection [IP]:[port](63 connections now open)
Router 2 Log:
2016-01-15T16:18:21.762-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e3d110ccb8e38549a9d
2016-01-15T16:18:24.316-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:24 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:18:24.978-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:35.295-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e4a110ccb8e38549a9f
2016-01-15T16:18:38.507-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:48.838-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e58110ccb8e38549aa1
2016-01-15T16:18:52.038-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:54.660-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:54 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:19:02.323-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e66110ccb8e38549aa3
2016-01-15T16:19:05.513-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
Problematic Shard Log:
2016-01-15T16:21:03.426-0700 W SHARDING [conn40] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 46 took 17364ms
2016-01-15T16:21:03.484-0700 I COMMAND [conn40] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 17421ms
2016-01-15T16:21:03.775-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:21:03 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1441718306:765353801', sleeping for 30000ms
2016-01-15T16:21:04.321-0700 I SHARDING [conn40] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:08.243-0700 I SHARDING [conn46] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:10.174-0700 W SHARDING [conn37] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 60 took 18516ms
2016-01-15T16:21:10.232-0700 I COMMAND [conn37] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18574ms
2016-01-15T16:21:10.989-0700 W SHARDING [conn25] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 62 took 18187ms
2016-01-15T16:21:11.047-0700 I COMMAND [conn25] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18246ms
2016-01-15T16:21:11.365-0700 I SHARDING [conn37] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
For the splitting error - Upgrading to Mongo v.3.0.8+ resolved it
Still having an issue with the balancing itself...shard key is an md5 check sum so unless they all have very similar md5s (not very likely) there is still investigating to do....using range based partitioning
there are multiple ways to check
db.printShardingStatus() - this will give all collections sharded and whether auto balancer is on and current collection taken for sharding from when
sh.status(true) - this will give chunk level details. Look whether your chunk has jumbo:true . In case chunk is marked as jumbo it will not be split properly.
db.collection.stats() -- this will give collection stats and see each shard distribution details there
Related
I am trying to convert my standalone MongoDB instance to a single-node replica set, for the purpose of live migrating to Atlas.
I followed this procedure: https://docs.mongodb.com/manual/tutorial/convert-standalone-to-replica-set/
The step I took were:
$sudo service mongodb stop
$sudo service mongod start
$mongo
>rs.initiate()
{
"info2" : "no configuration explicitly specified -- making one",
"me" : "staging3.domain.io:27017",
"info" : "Config now saved locally. Should come online in about a minute.",
"ok" : 1
}
singleNodeRepl:PRIMARY> rs.status()
{
"set" : "singleNodeRepl",
"date" : ISODate("2020-11-26T00:46:25Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "staging4.domain.io:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 1197,
"optime" : Timestamp(1606350415, 1),
"optimeDate" : ISODate("2020-11-26T00:26:55Z"),
"electionTime" : Timestamp(1606350415, 2),
"electionDate" : ISODate("2020-11-26T00:26:55Z"),
"self" : true
}
],
"ok" : 1
}
singleNodeRepl:PRIMARY> db.oplog.rs.find()
{ "ts" : Timestamp(1606350415, 1), "h" : NumberLong(0), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
At this point, it seems to have no issues.
However, my application is not being able to work as it did before.
Would really appreciate any help in troubleshooting the issue.
Thank you.
EDIT:
As suggested I included replSet in the config file instead passing it as an argument.
This is my config file:
# mongod.conf
#where to log
logpath=/var/log/mongodb/mongod.log
logappend=true
# fork and run in background
fork=true
#port=27017
dbpath=/var/lib/mongo
# location of pidfile
pidfilepath=/var/run/mongodb/mongod.pid
# Listen to local interface only. Comment out to listen on all interfaces.
#bind_ip=127.0.0.1
# Disables write-ahead journaling
# nojournal=true
# Enables periodic logging of CPU utilization and I/O wait
#cpu=true
# Turn on/off security. Off is currently the default
#noauth=true
#auth=true
# Verbose logging output.
verbose=true
# Inspect all client data for validity on receipt (useful for
# developing drivers)
#objcheck=true
# Enable db quota management
#quota=true
# Set oplogging level where n is
# 0=off (default)
# 1=W
# 2=R
# 3=both
# 7=W+some reads
#diaglog=0
# Ignore query hints
#nohints=true
# Enable the HTTP interface (Defaults to port 28017).
#httpinterface=true
# Turns off server-side scripting. This will result in greatly limited
# functionality
#noscripting=true
# Turns off table scans. Any query that would do a table scan fails.
#notablescan=true
# Disable data file preallocation.
#noprealloc=true
# Specify .ns file size for new databases.
# nssize=<size>
# Replication Options
# in replicated mongo databases, specify the replica set name here
replSet=singleNodeRepl
# maximum size in megabytes for replication operation log
#oplogSize=1024
# path to a key file storing authentication info for connections
# between replica set members
#keyFile=/path/to/keyfile
And verbose log file:
It does look like everything is working fine. However, my application is not able to connect to the DB as it did.
2020-11-26T00:26:55.852+0000 [conn1] replSet replSetInitiate admin command received from client
2020-11-26T00:26:55.853+0000 [conn1] replSet info initiate : no configuration specified. Using a default configuration for the set
2020-11-26T00:26:55.853+0000 [conn1] replSet created this configuration for initiation : { _id: "singleNodeRepl", members: [ { _id: 0, host: "staging4.domain.io:27017" } ] }
2020-11-26T00:26:55.853+0000 [conn1] replSet replSetInitiate config object parses ok, 1 members specified
2020-11-26T00:26:55.853+0000 [conn1] getMyAddrs(): [127.0.0.1] [10.20.26.228] [::1] [fe80::8ed:65ff:fe9e:15ab%eth0]
2020-11-26T00:26:55.853+0000 [conn1] getallIPs("staging4.domain.io"): [127.0.0.1]
2020-11-26T00:26:55.853+0000 [conn1] replSet replSetInitiate all members seem up
2020-11-26T00:26:55.853+0000 [conn1] ******
2020-11-26T00:26:55.853+0000 [conn1] creating replication oplog of size: 2570MB...
2020-11-26T00:26:55.853+0000 [conn1] create collection local.oplog.rs { size: 2695574937.6, capped: true, autoIndexId: false }
2020-11-26T00:26:55.853+0000 [conn1] Database::_addNamespaceToCatalog ns: local.oplog.rs
2020-11-26T00:26:55.866+0000 [conn1] ExtentManager::increaseStorageSize ns:local.oplog.rs desiredSize:2146426624 fromFreeList: 0 eloc: 1:2000
2020-11-26T00:26:55.876+0000 [conn1] ExtentManager::increaseStorageSize ns:local.oplog.rs desiredSize:549148160 fromFreeList: 0 eloc: 2:2000
2020-11-26T00:26:55.878+0000 [conn1] ******
2020-11-26T00:26:55.878+0000 [conn1] replSet info saving a newer config version to local.system.replset: { _id: "singleNodeRepl", version: 1, members: [ { _id: 0, host: "staging4.domain.io:27017" } ] }
2020-11-26T00:26:55.878+0000 [conn1] Database::_addNamespaceToCatalog ns: local.system.replset
2020-11-26T00:26:55.878+0000 [conn1] ExtentManager::increaseStorageSize ns:local.system.replset desiredSize:8192 fromFreeList: 0 eloc: 2:20bb8000
2020-11-26T00:26:55.878+0000 [conn1] Database::_addNamespaceToCatalog ns: local.system.replset.$_id_
2020-11-26T00:26:55.878+0000 [conn1] build index on: local.system.replset properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.system.replset" }
2020-11-26T00:26:55.878+0000 [conn1] local.system.replset: clearing plan cache - collection info cache reset
2020-11-26T00:26:55.878+0000 [conn1] allocating new extent
2020-11-26T00:26:55.878+0000 [conn1] ExtentManager::increaseStorageSize ns:local.system.replset.$_id_ desiredSize:131072 fromFreeList: 0 eloc: 2:20bba000
2020-11-26T00:26:55.878+0000 [conn1] added index to empty collection
2020-11-26T00:26:55.878+0000 [conn1] local.system.replset: clearing plan cache - collection info cache reset
2020-11-26T00:26:55.878+0000 [conn1] replSet saveConfigLocally done
2020-11-26T00:26:55.878+0000 [conn1] replSet replSetInitiate config now saved locally. Should come online in about a minute.
2020-11-26T00:26:55.878+0000 [conn1] command admin.$cmd command: replSetInitiate { replSetInitiate: undefined } keyUpdates:0 numYields:0 locks(micros) W:25362 reslen:206 25ms
2020-11-26T00:26:55.879+0000 [conn1] command test.$cmd command: isMaster { isMaster: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:270 0ms
2020-11-26T00:27:01.256+0000 [conn1] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:27:01.257+0000 [conn1] command test.$cmd command: isMaster { isMaster: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:27:10.688+0000 [conn1] query local.system.replset planSummary: COLLSCAN ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:97 nreturned:1 reslen:126 0ms
2020-11-26T00:27:10.689+0000 [conn1] command test.$cmd command: isMaster { isMaster: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:27:28.889+0000 [clientcursormon] connections:1
2020-11-26T00:27:33.333+0000 [conn1] end connection 127.0.0.1:50580 (0 connections now open)
2020-11-26T00:27:57.230+0000 [initandlisten] connection accepted from 127.0.0.1:50582 #2 (1 connection now open)
2020-11-26T00:27:57.230+0000 [conn2] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:62 0ms
2020-11-26T00:27:57.232+0000 [conn2] command admin.$cmd command: getLog { getLog: "startupWarnings" } keyUpdates:0 numYields:0 reslen:70 0ms
2020-11-26T00:27:57.233+0000 [conn2] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:28:00.237+0000 [conn2] command admin.$cmd command: serverStatus { serverStatus: 1.0 } keyUpdates:0 numYields:0 locks(micros) r:13 reslen:3402 0ms
2020-11-26T00:28:00.242+0000 [conn2] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:28:16.560+0000 [conn2] end connection 127.0.0.1:50582 (0 connections now open)
2020-11-26T00:32:28.904+0000 [clientcursormon] connections:0
2020-11-26T00:36:32.398+0000 [initandlisten] connection accepted from 127.0.0.1:50588 #3 (1 connection now open)
2020-11-26T00:36:32.398+0000 [conn3] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:62 0ms
2020-11-26T00:36:32.399+0000 [conn3] command admin.$cmd command: getLog { getLog: "startupWarnings" } keyUpdates:0 numYields:0 reslen:70 0ms
2020-11-26T00:36:32.400+0000 [conn3] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:36:34.603+0000 [conn3] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:36:37.326+0000 [conn3] query local.oplog.rs planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:66 nreturned:1 reslen:106 0ms
2020-11-26T00:36:37.328+0000 [conn3] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:37:28.832+0000 [initandlisten] connection accepted from 10.20.37.160:54484 #4 (2 connections now open)
2020-11-26T00:37:28.832+0000 [conn4] command admin.$cmd command: isMaster { isMaster: 1, compression: [], client: { driver: { name: "mongo-ruby-driver", version: "2.13.1" }, os: { type: "linux", name: "linux-gnu", architecture: "x86_64" }, platform: "mongoid-6.4.1, Ruby 2.6.5, x86_64-linux, x86_64-pc-linux-gnu" } } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:37:28.919+0000 [clientcursormon] connections:2
2020-11-26T00:37:33.568+0000 [initandlisten] connection accepted from 10.20.37.160:54492 #5 (3 connections now open)
2020-11-26T00:37:33.569+0000 [conn5] command admin.$cmd command: isMaster { isMaster: 1, compression: [], client: { driver: { name: "mongo-ruby-driver", version: "2.13.1" }, os: { type: "linux", name: "linux-gnu", architecture: "x86_64" }, platform: "mongoid-6.4.1, Ruby 2.6.5, x86_64-linux, x86_64-pc-linux-gnu" } } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:37:36.586+0000 [conn3] end connection 127.0.0.1:50588 (2 connections now open)
2020-11-26T00:39:35.621+0000 [initandlisten] connection accepted from 127.0.0.1:50592 #6 (3 connections now open)
2020-11-26T00:39:35.621+0000 [conn6] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:62 0ms
2020-11-26T00:39:35.622+0000 [conn6] command admin.$cmd command: getLog { getLog: "startupWarnings" } keyUpdates:0 numYields:0 reslen:70 0ms
2020-11-26T00:39:35.623+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:37.589+0000 [conn6] opening db: test
2020-11-26T00:39:37.589+0000 [conn6] query test.oplog.rs planSummary: EOF ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 numYields:0 locks(micros) W:186 r:19 nreturned:0 reslen:20 0ms
2020-11-26T00:39:37.590+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:41.891+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:43.266+0000 [conn6] query local.oplog.rs planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:62 nreturned:1 reslen:106 0ms
2020-11-26T00:39:43.268+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:52.681+0000 [conn6] end connection 127.0.0.1:50592 (2 connections now open)
2020-11-26T00:42:28.934+0000 [clientcursormon] connections:2
You should not mix using a config file, i.e.
mongod --config /etc/mongod.conf
and command line options
mongod --replSet rs0 --bind_ip localhost
Most likely in your config you did not set in /etc/mongod.conf
replication:
replSetName: <string>
So, when you start your MongoDB with service mongodb start then you may have a different configuration.
Note, check the service file (in my Redhat at /etc/systemd/system/mongod.service) which may point even to a different .conf file.
I'm having trouble with the following findOneAndUpdate MongoDB query:
planSummary: IXSCAN { id: 1 } keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:0 reslen:3044791
locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } }
storage:{} protocol:op_query 135ms
writeConcern: { w: 0, j: false }
As you can see it has execution time of +100 ms. The query part uses an index and takes less than 1ms (using 'Explain query'). So it's the write part that is slow.
The Mongo instance is the master of a 3 member replica set. Write concern is set to 0 and journaling is disabled.
What could be the cause of the slow write? Could it be the update of indices?
MongoDB version 4.0
Driver: Node.js native mongodb version 3.2
Edit: I think it might be the length of the result. After querying a document smaller in size, the execution time is halved.
reslen:3044791
This was the source of the bad performance. Reducing this by adding a projection option to only return a specific field improved the execution from ~90ms on average to ~7ms.
Using a MongoDB default setup with max chunk size of 64Mb and max document size of 16Mb (https://docs.mongodb.org/manual/reference/limits/).
I had the issue of a chunk to big that needed to be split but the split failed.
I don't understand why. Let's assume we have big documents, if a chunk is created with a document (16Mb), then a new document comes (32Mb), and so on, until we reach 64Mb, then the split process occurs. Why can it fail? why don't split the chunk in two halves of 2 16Mb documents?
Or in the case the documents are like:
10Mb + 16Mb + 16Mb + 16Mb + 16Mb = 74Mb (needs to split)
It can do:
10Mb + 16Mb + 16Mb = 42Mb
16Mb + 16Mb = 32Mb
Why is this failing?
MongoDB log:
2016-04-06T12:10:53.424-0700 I SHARDING [Balancer] ns: reports.storePoints going to move { id: "reports.storePoints-storeId-9150629136201875191", ns: "reports.storePoints", min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }, version: Timestamp 666000|1, versionEpoch: ObjectId('54ac908475d02f0bdb362171'), lastmod: Timestamp 666000|1, lastmodEpoch: ObjectId('54ac908475d02f0bdb362171'), shard: "rs1" } from: rs1 to: rs2 tag []
2016-04-06T12:10:53.477-0700 I SHARDING [Balancer] moving chunk ns: reports.storePoints moving ( ns: reports.storePoints, shard: rs1:rs1/host1:27017,host2:27017, lastmod: 666|1||000000000000000000000000, min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }) rs1:rs1/host1:27017,host1:27017 -> rs2:rs2/host2:27017,host2:27017
2016-04-06T12:10:54.052-0700 I SHARDING [Balancer] moveChunk result: { chunkTooBig: true, estimatedChunkSize: 111379744, ok: 0.0, errmsg: "chunk too big to move", $gleStats: { lastOpTime: Timestamp 1455673163000|2, electionId: ObjectId('55b838ecf80a354067abdbbf') } }
2016-04-06T12:10:54.053-0700 I SHARDING [Balancer] balancer move failed: { chunkTooBig: true, estimatedChunkSize: 111379744, ok: 0.0, errmsg: "chunk too big to move", $gleStats: { lastOpTime: Timestamp 1455673163000|2, electionId: ObjectId('55b838ecf80a354067abdbbf') } } from: rs1 to: rs2 chunk: min: { storeId: -9150629136201875191 } max: { storeId: -9148794391694793543 }
2016-04-06T12:10:54.053-0700 I SHARDING [Balancer] performing a split because migrate failed for size reasons
2016-04-06T12:10:54.063-0700 I SHARDING [Balancer] split results: CannotSplit chunk not full enough to trigger auto-split
2016-04-06T12:10:54.063-0700 I SHARDING [Balancer] marking chunk as jumbo: ns: reports.storePoints, shard: rs1:rs1/host1:27017,host2:27017, lastmod: 666|1||000000000000000000000000, min: { storeId: -9150629136201875191 }, max: { storeId: -9148794391694793543 }
How do I enable sharding in test environment? Here I am sharing what i have done till now, I have one config server:
Config server1: Host-a:27019
One mongos instance on same machine on port 27017
and two mongod instance shard:
Host-a:27020
host-b:27021
When I am enabling sharding on collection it gives me this error:
2016-01-12T10:31:07.522Z I SHARDING [Balancer] ns: feedproductsdata.merchantproducts going to move { _id: "feedproductsdata.merchantproducts-product_id_MinKey", ns: "feedproductsdata.merchantproducts", min: { product_id: MinKey }, max: { product_id: 0 }, version: Timestamp 1000|0, versionEpoch: ObjectId('5694d57ebe78315b68519c38'), lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('5694d57ebe78315b68519c38'), shard: "shard0001" } from: shard0001 to: shard0000 tag []
2016-01-12T10:31:07.523Z I SHARDING [Balancer] moving chunk ns: feedproductsdata.merchantproducts moving ( ns: feedproductsdata.merchantproducts, shard: shard0001:192.168.1.12:27021, lastmod: 1|0||000000000000000000000000, min: { product_id: MinKey }, max: { product_id: 0 }) shard0001:192.168.1.12:27021 -> shard0000:192.168.1.8:27020
2016-01-12T10:31:08.530Z I SHARDING [Balancer] moveChunk result: { errmsg: "exception: socket exception [CONNECT_ERROR] for cfg1.server.com:27019", code: 11002, ok: 0.0 }
2016-01-12T10:31:08.531Z I SHARDING [Balancer] balancer move failed: { errmsg: "exception: socket exception [CONNECT_ERROR] for cfg1.server.com:27019", code: 11002, ok: 0.0 } from: shard0001 to: shard0000 chunk: min: { product_id: MinKey } max: { product_id: 0 }
2016-01-12T10:31:08.604Z I SHARDING [Balancer] distributed lock 'balancer/Knowledgeops-PC:27017:1452594322:41' unlocked.
After carefully following this guide, on how to migrate a 3 member replica set to a shard containing 2 X 3 member replica sets, after sharding one of my collections nothing happened. Looking into the logs, i got this:
2014-06-22T16:49:55.467+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' unlocked.
2014-06-22T16:50:01.830+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' acquired, ts : 53a70939025c137381ef2126
2014-06-22T16:50:01.945+0000 [Balancer] ns: database.emailevents going to move { _id: "database.emailevents-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('53a6d810089f342ad32992d4'), ns: "database.emailevents", min: { _id: MinKey }, max: { _id: -9199836772564449863 }, shard: "replicaset2" } from: replicaset2 to: replicaset4 tag []
2014-06-22T16:50:01.945+0000 [Balancer] moving chunk ns: database.emailevents moving ( ns: database.emailevents, shard: replicaset2:replicaset2/vt-mongo-4:27017,vt-mongo-5:27017,vt-mongo-6:27017, lastmod: 1|0||000000000000000000000000, min: { _id: MinKey }, max: { _id: -9199836772564449863 }) replicaset2:database_replicaset2/vt-mongo-4:27017,vt-mongo-5:27017,vt-mongo-6:27017 -> replicaset4:replicaset4/vt-mongo-1:27017,vt-mongo-2:27017,vt-mongo-3:27017
2014-06-22T16:50:01.954+0000 [Balancer] moveChunk result: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 }
2014-06-22T16:50:01.955+0000 [Balancer] balancer move failed: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 } from: replicaset2 to: replicaset4 chunk: min: { _id: MinKey } max: { _id: -9199836772564449863 }
2014-06-22T16:50:02.166+0000 [Balancer] distributed lock 'balancer/vt-mongo-6:27018:1403429707:1804289383' unlocked.
generally, this error:
result: { errmsg: "exception: no primary shard configured for db: config", code: 8041, ok: 0.0 }
is the result of the machines not being able to talk to one-eachother, but this isn't the case. I have manually checked and all the machines are able to talk to eachother.