Sudden Mongodb high connections/queues, db completely freezes - mongodb

The issue
We have a strange issue on our mongodb setup. Sometimes we get peaks of high connections and high queues and the mongodb process stops responding if we let the queues and connections increase. We need to restart the instance using sigkill with htop.
It seems that there is a system limit / mongodb configuration blocking mongodb from operating, because hardware resources are ok. Versions of this issue happening on stand alone and then replica set on production servers. Details ahead.
About the software environment
This is a stand alone mongodb instance (not sharded nor replica sets), it's operating on a dedicated machine, and it's queried by other machines. I'm using mongodb-linux-x86_64-2.6.11 under Debian 7.7.
The machines querying mongo are using Django==1.7.4, Mongoengine=0.10.1 with pymongo==2.8.
On the Django settings.py file I'm connecting to the database using the following lines:
from mongoengine import connect
connect(
MONGO_DB,
username = MONGO_USER,
password = MONGO_PWD,
host = MONGO_HOST,
port = MONGO_PORT
)
MMS Stats
As you can see in the following img from the MMS service we have peaks on connections and queques:
When this happens, our mongodb process completely freezes. We must use SIGKILL to restart mongodb, which is really bad.
In the image there are 3 freeze events.
As the img shows, when this happens, we have a peak on Non-Mapped Virtual Memory too.
Also we spotted an increase on the Btree chart around the 2nd and 3rd freeze.
We have checked the logs, but there is no suspicious query, also the Opcounters don't skyrocket, it seems that there are no more queries than usual.
Here is another screenshot on the same bug but on another day/time:
On all the cases, the lock on the DB is not significantly increasing, it has a peak but not reaching even 4%:
OpCounter drops to zero, it seems that every op goes to the mongodb queque, so the database creates new connections to try to execute new requests, all of them going to the queue as well.
Machine Resources
Regarding hardware, the machine is a Google Cloud Compute instance with 4 Intel Xeon Cores, 16 Gb ram, 100 GB SSD disk.
No noticeable high network/io/CPU/ram issues detected, no peaks on resources, even when the mongod process is frozen.
MySQL on another machine also gets affected
Also we detect that at the same time of this mongod peak on queques and connections, we also get a spike on mysql connections, which is running on another machine. When I kill the mongodb process, all the mysql connections are released too (without doing a mysql restart).
ulimit
I increased system limits, to see if that was the cause of the issue but it seems that this did not fix the problem.
I set up everything as recommended on this MongoDB article. The spike on connections continue. I'm trying to find a way to debug where are this connections coming from.
$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 60240
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 409600
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 60240
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
db.currentOp
I just added a shell scripts that runs every 1 second with the following:
var ops = db.currentOp().inprog
if (ops !== undefined && ops.length > 0){
ops.forEach(function(op){
if(op.secs_running > 0) printjson(op);
})
}
The log does not report any operation that is taking more than 1 second to execute. I was thinking about a process taking long time on something but it seems that is not the case.
MongoDB Logs
Regarding the mongodb.log, here is the full mongodb log around the problem.
It just happens on log line 361. There the connections start to go up, and no more queries get executed. Also I cant call the mongo shell, it says:
[Wed Feb 10 15:46:01 UTC 2016] 2016-02-10T15:48:31.940+0000 DBClientCursor::init call() failed
2016-02-10T15:48:31.941+0000 Error: DBClientBase::findN: transport error: 127.0.0.1:27000 ns: admin.$cmd query: { whatsmyuri: 1 } at src/mongo/shell/mongo.js:148
Log extract
2016-02-10T15:41:39.930+0000 [initandlisten] connection accepted from 10.240.0.3:56611 #3665 (79 connections now open)
2016-02-10T15:41:39.930+0000 [conn3665] command admin.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:39.930+0000 [conn3665] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 numYields:0 reslen:37 0ms
2016-02-10T15:41:39.992+0000 [conn3529] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 310 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:215 reslen:48 0ms
2016-02-10T15:41:40.038+0000 [conn2303] query db.column query: { _id: ObjectId('56b395dfbe66324cbee550b8'), client_id: 20 } planSummary: IXSCAN { _id: 1 } ntoreturn:2 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:116 nreturned:1 reslen:470 0ms
2016-02-10T15:41:40.044+0000 [conn1871] update db.column query: { _id: ObjectId('56b395dfbe66324cbee550b8') } update: { $set: { last_request: new Date(1455118900040) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 fastmod:1 keyUpdates:0 numYields:0 locks(micros) w:126 0ms
2016-02-10T15:41:40.044+0000 [conn1871] command db.$cmd command: update { update: "column", writeConcern: { w: 1 }, updates: [ { q: { _id: ObjectId('56b395dfbe66324cbee550b8') }, u: { $set: { last_request: new Date(1455118900040) } }, multi: false, upsert: true } ] } keyUpdates:0 numYields:0 reslen:55 0ms
2016-02-10T15:41:40.048+0000 [conn1875] query db.user query: { sn: "mobile", client_id: 20, uid: "56990023700" } planSummary: IXSCAN { client_id: 1, uid: 1, sn: 1 } ntoreturn:2 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:197 nreturned:1 reslen:303 0ms
2016-02-10T15:41:40.056+0000 [conn2303] Winning plan had zero results. Not caching. ns: db.case query: { sn: "mobile", client_id: 20, created: { $gt: new Date(1454295600000), $lt: new Date(1456800900000) }, deleted: false, establishment_users: { $all: [ ObjectId('5637640afefa2654b5d863e3') ] }, is_closed: true, updated_time: { $gt: new Date(1455045840000) } } sort: { updated_time: 1 } projection: {} skip: 0 limit: 15 winner score: 1.0003 winner summary: IXSCAN { client_id: 1, is_closed: 1, deleted: 1, updated_time: 1 }
2016-02-10T15:41:40.057+0000 [conn2303] query db.case query: { $query: { sn: "mobile", client_id: 20, created: { $gt: new Date(1454295600000), $lt: new Date(1456800900000) }, deleted: false, establishment_users: { $all: [ ObjectId('5637640afefa2654b5d863e3') ] }, is_closed: true, updated_time: { $gt: new Date(1455045840000) } }, $orderby: { updated_time: 1 } } planSummary: IXSCAN { client_id: 1, is_closed: 1, deleted: 1, updated_time: 1 } ntoreturn:15 ntoskip:0 nscanned:26 nscannedObjects:26 keyUpdates:0 numYields:0 locks(micros) r:5092 nreturned:0 reslen:20 5ms
2016-02-10T15:41:40.060+0000 [conn300] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 309 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:63 reslen:48 0ms
2016-02-10T15:41:40.133+0000 [initandlisten] connection accepted from 127.0.0.1:43266 #3666 (80 connections now open)
2016-02-10T15:41:40.133+0000 [conn3666] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:62 0ms
2016-02-10T15:41:40.134+0000 [conn3666] command db.$cmd command: getnonce { getnonce: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:40.134+0000 [conn3666] authenticate db: db { authenticate: 1, nonce: "xxx", user: "xxx", key: "xxx" }
2016-02-10T15:41:40.134+0000 [conn3666] command db.$cmd command: authenticate { authenticate: 1, nonce: "xxx", user: "xxx", key: "xxx" } ntoreturn:1 keyUpdates:0 numYields:0 reslen:82 0ms
2016-02-10T15:41:40.136+0000 [conn3666] end connection 127.0.0.1:43266 (79 connections now open)
2016-02-10T15:41:40.146+0000 [conn3051] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 301 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:284 reslen:48 0ms
2016-02-10T15:41:40.526+0000 [conn3529] query db.column query: { _id: ObjectId('56a8d864be6632718f9fb087'), client_id: 1 } planSummary: IXSCAN { _id: 1 } ntoreturn:2 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:176 nreturned:1 reslen:440 0ms
2016-02-10T15:41:40.529+0000 [conn3529] update db.column query: { _id: ObjectId('56a8d864be6632718f9fb087') } update: { $set: { last_request: new Date(1455118900527) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 fastmod:1 keyUpdates:0 numYields:0 locks(micros) w:61 0ms
2016-02-10T15:41:40.529+0000 [conn3529] command db.$cmd command: update { update: "column", writeConcern: { w: 1 }, updates: [ { q: { _id: ObjectId('56a8d864be6632718f9fb087') }, u: { $set: { last_request: new Date(1455118900527) } }, multi: false, upsert: true } ] } keyUpdates:0 numYields:0 reslen:55 0ms
2016-02-10T15:41:40.531+0000 [conn3529] query db.user query: { sn: "email", client_id: 1, uid: "asdasdasdasdas" } planSummary: IXSCAN { client_id: 1, uid: 1, sn: 1 } ntoreturn:2 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:278 nreturned:1 reslen:285 0ms
2016-02-10T15:41:40.546+0000 [conn3529] Winning plan had zero results. Not caching. ns: db.case query: { answered: true, sn: "email", client_id: 1, establishment_users: { $all: [ ObjectId('5669b930fefa2626db389c0e') ] }, deleted: false, is_closed: { $ne: true } } sort: { updated_time: -1 } projection: {} skip: 0 limit: 1 winner score: 1.0003 winner summary: IXSCAN { client_id: 1, establishment_users: 1, updated_time: 1 }
2016-02-10T15:41:40.547+0000 [conn3529] query db.case query: { $query: { answered: true, sn: "email", client_id: 1, establishment_users: { $all: [ ObjectId('5669b930fefa2626db389c0e') ] }, deleted: false, is_closed: { $ne: true } }, $orderby: { updated_time: -1 } } planSummary: IXSCAN { client_id: 1, establishment_users: 1, updated_time: 1 } ntoskip:0 nscanned:103 nscannedObjects:103 keyUpdates:0 numYields:0 locks(micros) r:9410 nreturned:0 reslen:20 9ms
2016-02-10T15:41:40.557+0000 [conn3529] Winning plan had zero results. Not caching. ns: db.case query: { answered: true, sn: "email", client_id: 1, establishment_users: { $all: [ ObjectId('5669b930fefa2626db389c0e') ] }, deleted: false, is_closed: { $ne: true } } sort: { updated_time: -1 } projection: {} skip: 0 limit: 15 winner score: 1.0003 winner summary: IXSCAN { client_id: 1, establishment_users: 1, updated_time: 1 }
2016-02-10T15:41:40.558+0000 [conn3529] query db.case query: { $query: { answered: true, sn: "email", client_id: 1, establishment_users: { $all: [ ObjectId('5669b930fefa2626db389c0e') ] }, deleted: false, is_closed: { $ne: true } }, $orderby: { updated_time: -1 } } planSummary: IXSCAN { client_id: 1, establishment_users: 1, updated_time: 1 } ntoreturn:15 ntoskip:0 nscanned:103 nscannedObjects:103 keyUpdates:0 numYields:0 locks(micros) r:7572 nreturned:0 reslen:20 7ms
2016-02-10T15:41:40.569+0000 [conn3028] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 145 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:237 reslen:48 0ms
2016-02-10T15:41:40.774+0000 [conn3053] command db.$cmd command: count { count: "notification", fields: null, query: { read: false, recipient: 143 } } planSummary: IXSCAN { recipient: 1 } keyUpdates:0 numYields:0 locks(micros) r:372 reslen:48 0ms
2016-02-10T15:41:41.056+0000 [conn22] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 numYields:0 reslen:37 0ms
#########################
HERE THE PROBLEM STARTS
#########################
2016-02-10T15:41:41.175+0000 [initandlisten] connection accepted from 127.0.0.1:43268 #3667 (80 connections now open)
2016-02-10T15:41:41.212+0000 [initandlisten] connection accepted from 10.240.0.6:46021 #3668 (81 connections now open)
2016-02-10T15:41:41.213+0000 [conn3668] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:41.213+0000 [conn3668] authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:41.213+0000 [conn3668] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0 reslen:82 0ms
2016-02-10T15:41:41.348+0000 [initandlisten] connection accepted from 10.240.0.6:46024 #3669 (82 connections now open)
2016-02-10T15:41:41.349+0000 [conn3669] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:41.349+0000 [conn3669] authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:41.349+0000 [conn3669] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0 reslen:82 0ms
2016-02-10T15:41:43.620+0000 [initandlisten] connection accepted from 10.240.0.6:46055 #3670 (83 connections now open)
2016-02-10T15:41:43.621+0000 [conn3670] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:43.621+0000 [conn3670] authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:43.621+0000 [conn3670] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0 reslen:82 0ms
2016-02-10T15:41:43.655+0000 [initandlisten] connection accepted from 10.240.0.6:46058 #3671 (84 connections now open)
2016-02-10T15:41:43.656+0000 [conn3671] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:43.656+0000 [conn3671] authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:43.656+0000 [conn3671] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0 reslen:82 0ms
2016-02-10T15:41:44.045+0000 [initandlisten] connection accepted from 10.240.0.6:46071 #3672 (85 connections now open)
2016-02-10T15:41:44.045+0000 [conn3672] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:44.046+0000 [conn3672] authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:44.046+0000 [conn3672] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0 reslen:82 0ms
2016-02-10T15:41:44.083+0000 [initandlisten] connection accepted from 10.240.0.6:46073 #3673 (86 connections now open)
2016-02-10T15:41:44.084+0000 [conn3673] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
2016-02-10T15:41:44.084+0000 [conn3673] authenticate db: db { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" }
2016-02-10T15:41:44.084+0000 [conn3673] command db.$cmd command: authenticate { authenticate: 1, user: "xxx", nonce: "xxx", key: "xxx" } keyUpdates:0 numYields:0 reslen:82 0ms
2016-02-10T15:41:44.182+0000 [initandlisten] connection accepted from 10.240.0.6:46076 #3674 (87 connections now open)
2016-02-10T15:41:44.182+0000 [conn3674] command db.$cmd command: getnonce { getnonce: 1 } keyUpdates:0 numYields:0 reslen:65 0ms
Collection Information
Currently our database contains 163 collections. The important ones are messages, column and cases, this are the ones that get heavy inserts, updates and queries on. The rest if for analytics and are many collections of about 100 records each:
{
"ns" : "db.message",
"count" : 2.96615e+06,
"size" : 3906258304.0000000000000000,
"avgObjSize" : 1316,
"storageSize" : 9305935856.0000000000000000,
"numExtents" : 25,
"nindexes" : 21,
"lastExtentSize" : 2.14643e+09,
"paddingFactor" : 1.0530000000000086,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 7952525392.0000000000000000,
"indexSizes" : {
"_id_" : 1.63953e+08,
"client_id_1_sn_1_mid_1" : 3.16975e+08,
"client_id_1_created_1" : 1.89086e+08,
"client_id_1_recipients_1_created_1" : 4.3861e+08,
"client_id_1_author_1_created_1" : 2.29713e+08,
"client_id_1_kind_1_created_1" : 2.37088e+08,
"client_id_1_answered_1_created_1" : 1.90934e+08,
"client_id_1_is_mention_1_created_1" : 1.8674e+08,
"client_id_1_has_custom_data_1_created_1" : 1.9566e+08,
"client_id_1_assigned_1_created_1" : 1.86838e+08,
"client_id_1_published_1_created_1" : 1.94352e+08,
"client_id_1_sn_1_created_1" : 2.3681e+08,
"client_id_1_thread_root_1" : 1.88089e+08,
"client_id_1_case_id_1" : 1.89266e+08,
"client_id_1_sender_id_1" : 1.5182e+08,
"client_id_1_recipient_id_1" : 1.49711e+08,
"client_id_1_mid_1_sn_1" : 3.17662e+08,
"text_text_created_1" : 3320641520.0000000000000000,
"client_id_1_sn_1_kind_1_recipient_id_1_created_1" : 3.15226e+08,
"client_id_1_sn_1_thread_root_1_created_1" : 3.06526e+08,
"client_id_1_case_id_1_created_1" : 2.46825e+08
},
"ok" : 1.0000000000000000
}
{
"ns" : "db.case",
"count" : 497661,
"size" : 5.33111e+08,
"avgObjSize" : 1071,
"storageSize" : 6.29637e+08,
"numExtents" : 16,
"nindexes" : 34,
"lastExtentSize" : 1.68743e+08,
"paddingFactor" : 1.0000000000000000,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 8.46012e+08,
"indexSizes" : {
"_id_" : 2.30073e+07,
"client_id_1" : 1.99985e+07,
"is_closed, deleted_1" : 1.31061e+07,
"is_closed_1" : 1.36948e+07,
"sn_1" : 2.1274e+07,
"deleted_1" : 1.39728e+07,
"created_1" : 1.97777e+07,
"current_assignment_1" : 4.20819e+07,
"assigned_1" : 1.33678e+07,
"commented_1" : 1.36049e+07,
"has_custom_data_1" : 1.42426e+07,
"sentiment_start_1" : 1.36049e+07,
"sentiment_finish_1" : 1.37275e+07,
"updated_time_1" : 2.02192e+07,
"identifier_1" : 1.73822e+07,
"important_1" : 1.38256e+07,
"answered_1" : 1.41772e+07,
"client_id_1_is_closed_1_deleted_1_updated_time_1" : 2.90248e+07,
"client_id_1_is_closed_1_updated_time_1" : 2.86569e+07,
"client_id_1_sn_1_updated_time_1" : 3.58436e+07,
"client_id_1_deleted_1_updated_time_1" : 2.8477e+07,
"client_id_1_updated_time_1" : 2.79619e+07,
"client_id_1_current_assignment_1_updated_time_1" : 5.6071e+07,
"client_id_1_assigned_1_updated_time_1" : 2.87713e+07,
"client_id_1_commented_1_updated_time_1" : 2.86896e+07,
"client_id_1_has_custom_data_1_updated_time_1" : 2.88286e+07,
"client_id_1_sentiment_start_1_updated_time_1" : 2.87223e+07,
"client_id_1_sentiment_finish_1_updated_time_1" : 2.88776e+07,
"client_id_1_identifier_1_updated_time_1" : 3.48216e+07,
"client_id_1_important_1_updated_time_1" : 2.88776e+07,
"client_id_1_answered_1_updated_time_1" : 2.85669e+07,
"client_id_1_establishment_users_1_updated_time_1" : 3.93838e+07,
"client_id_1_identifier_1" : 1.86413e+07,
"client_id_1_sn_1_users_1_updated_time_1" : 4.47309e+07
},
"ok" : 1.0000000000000000
}
{
"ns" : "db.column",
"count" : 438,
"size" : 218672,
"avgObjSize" : 499,
"storageSize" : 696320,
"numExtents" : 4,
"nindexes" : 2,
"lastExtentSize" : 524288,
"paddingFactor" : 1.0000000000000000,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 65408,
"indexSizes" : {
"_id_" : 32704,
"client_id_1_owner_1" : 32704
},
"ok" : 1.0000000000000000
}
Mongostat
Here is some of the lines we have running mongostat during normal operation:
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time
*0 34 2 *0 0 10|0 0 32.6g 65.5g 1.18g 0 db:0.1% 0 0|0 0|0 4k 39k 87 20:44:44
2 31 13 *0 0 7|0 0 32.6g 65.5g 1.17g 3 db:0.8% 0 0|0 0|0 9k 36k 87 20:44:45
1 18 2 *0 0 5|0 0 32.6g 65.5g 1.12g 0 db:0.4% 0 0|0 0|0 3k 18k 87 20:44:46
5 200 57 *0 0 43|0 0 32.6g 65.5g 1.13g 12 db:2.3% 0 0|0 0|0 46k 225k 86 20:44:47
1 78 23 *0 0 5|0 0 32.6g 65.5g 1.01g 1 db:1.6% 0 0|0 0|0 18k 313k 86 20:44:48
*0 10 1 *0 0 5|0 0 32.6g 65.5g 1004m 0 db:0.2% 0 0|0 1|0 1k 8k 86 20:44:49
3 48 23 *0 0 11|0 0 32.6g 65.5g 1.05g 4 db:1.1% 0 0|0 0|0 16k 48k 86 20:44:50
2 38 13 *0 0 8|0 0 32.6g 65.5g 1.01g 8 db:0.9% 0 0|0 0|0 10k 76k 86 20:44:51
3 28 16 *0 0 9|0 0 32.6g 65.5g 1.01g 7 db:1.1% 0 0|0 1|0 11k 62k 86 20:44:52
*0 9 4 *0 0 8|0 0 32.6g 65.5g 1022m 1 db:0.4% 0 0|0 0|0 3k 6k 87 20:44:53
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time
3 107 34 *0 0 6|0 0 32.6g 65.5g 1.02g 1 db:1.1% 0 0|0 0|0 23k 107k 87 20:44:54
4 65 37 *0 0 8|0 0 32.6g 65.5g 2.69g 57 db:6.2% 0 0|0 0|0 24k 126k 87 20:44:55
9 84 45 *0 0 8|0 0 32.6g 65.5g 2.63g 17 db:5.3% 0 0|0 1|0 32k 109k 87 20:44:56
4 84 47 *0 0 44|0 0 32.6g 65.5g 1.89g 10 db:5.9% 0 0|0 1|0 30k 146k 86 20:44:57
3 73 32 *0 0 9|0 0 32.6g 65.5g 2.58g 12 db:4.7% 0 0|0 0|0 20k 112k 86 20:44:58
2 165 48 *0 0 7|0 0 32.6g 65.5g 2.62g 7 db:1.3% 0 0|0 0|0 34k 147k 86 20:44:59
3 61 26 *0 0 12|0 0 32.6g 65.5g 2.2g 6 db:4.7% 0 0|0 1|0 19k 73k 86 20:45:00
3 252 64 *0 0 12|0 0 32.6g 65.5g 1.87g 85 db:3.2% 0 0|0 0|0 52k 328k 86 20:45:01
*0 189 40 *0 0 6|0 0 32.6g 65.5g 1.65g 0 db:1.6% 0 0|0 0|0 33k 145k 87 20:45:02
1 18 10 *0 0 5|0 0 32.6g 65.5g 1.55g 3 db:0.9% 0 0|0 0|0 6k 15k 87 20:45:03
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time
1 50 11 *0 0 6|0 0 32.6g 65.5g 1.57g 6 db:0.8% 0 0|0 0|0 9k 63k 87 20:45:04
2 49 16 *0 0 6|0 0 32.6g 65.5g 1.56g 1 db:1.1% 0 0|0 0|0 12k 50k 87 20:45:05
1 35 11 *0 0 7|0 0 32.6g 65.5g 1.58g 1 db:0.9% 0 0|0 0|0 8k 41k 87 20:45:06
*0 18 2 *0 0 42|0 0 32.6g 65.5g 1.55g 0 db:0.4% 0 0|0 0|0 5k 19k 86 20:45:07
6 75 40 *0 0 11|0 0 32.6g 65.5g 1.56g 10 db:1.9% 0 0|0 0|0 27k 89k 86 20:45:08
6 60 35 *0 0 7|0 0 32.6g 65.5g 1.89g 5 db:1.5% 0 0|0 1|0 23k 101k 86 20:45:09
2 17 14 *0 0 7|0 0 32.6g 65.5g 1.9g 0 db:1.3% 0 0|0 1|0 8k 29k 86 20:45:10
2 35 7 *0 0 4|0 0 32.6g 65.5g 1.77g 1 db:1.3% 0 0|0 0|0 7k 60k 86 20:45:12
4 50 28 *0 0 10|0 0 32.6g 65.5g 1.75g 10 db:2.0% 0 0|0 0|0 19k 79k 87 20:45:13
*0 3 1 *0 0 5|0 0 32.6g 65.5g 1.63g 0 .:0.7% 0 0|0 0|0 1k 4k 87 20:45:14
insert query update delete getmore command flushes mapped vsize res faults locked db idx miss % qr|qw ar|aw netIn netOut conn time
5 77 35 *0 0 8|0 0 32.6g 65.5g 1.7g 13 db:3.0% 0 0|0 0|0 23k 124k 88 20:45:15
3 35 18 *0 0 7|0 0 32.6g 65.5g 1.7g 5 db:0.8% 0 0|0 0|0 12k 43k 87 20:45:16
1 18 5 *0 0 11|0 0 32.6g 65.5g 1.63g 2 db:0.9% 0 0|0 0|0 5k 35k 87 20:45:17
3 33 21 *0 0 5|0 0 32.6g 65.5g 1.64g 3 db:0.8% 0 0|0 0|0 13k 32k 87 20:45:18
*0 25 4 *0 0 42|0 0 32.6g 65.5g 1.64g 0 db:0.3% 0 0|0 0|0 5k 34k 86 20:45:19
1 25 5 *0 0 5|0 0 32.6g 65.5g 1.65g 3 db:0.2% 0 0|0 0|0 5k 24k 86 20:45:20
12 88 65 *0 0 7|0 0 32.6g 65.5g 1.7g 25 db:4.2% 0 0|0 0|0 42k 121k 86 20:45:21
2 53 17 *0 0 4|0 0 32.6g 65.5g 1.65g 2 db:1.5% 0 0|0 0|0 12k 82k 86 20:45:22
1 9 6 *0 0 7|0 0 32.6g 65.5g 1.64g 1 db:1.0% 0 0|0 0|0 4k 13k 86 20:45:23
*0 6 2 *0 0 7|0 0 32.6g 65.5g 1.63g 0 db:0.1% 0 0|0 0|0 1k 5k 87 20:45:24
Replica Set: Updated on May 15th 2016
We migrated our stand alone instance to a replica set. 2 secondaries serving the reads and 1 primary doing the writes. All the machines on the replica set area snapshots of the original machine. What happened with this new configuration is that the issue changed and it's harder to detect.
It happens less frequently but instead of sky rocketing connections and queues, the whole replica set stops reading/writing, with no high connections, no queues no expensive operations at all. All request to the DB just time out. To fix the issue a SIGKILL to the mongodb process must be sent to all 3 machines.

Hi this is exact problem we faced too and its very difficult to tell exact root cause and requires us lot of to and fro from official mongo support to understand some common problems.
Most of mongo setups done on unix and they have limit connections from same user though mongo server stats we can see lot of connections available. You get to get this setting to max possible value https://www.mongodb.com/docs/manual/reference/ulimit/
While connecting most of time we use default insert or insert many which has write concern as 1. In sharded cluster while saving to primary node is fast it takes time to replicate to other nodes and there are many connections left open there. If your cluster is 2 primary in 1 region and 1 secondary in DR region then network latency can come to play. Better go with majority write concern to avoid issues
https://www.mongodb.com/docs/manual/reference/write-concern/
There is max connection pool property which if not set will be default to 100 connection pool. So you application will try to create 100 connections if it requires to store fast. Limit connection pool based on your application need. We have very huge volume around 1 lakh per min still with multiple services max 20-30 connections sufficient to store that much volume.
https://www.mongodb.com/docs/manual/reference/connection-string/
We still trying to make mongo sharded infrastructure stable but its not mongo db itself. Its overall infrastructure that's causing the problem.

Related

Converting a standalone MongoDB instance to a single-node replica set

I am trying to convert my standalone MongoDB instance to a single-node replica set, for the purpose of live migrating to Atlas.
I followed this procedure: https://docs.mongodb.com/manual/tutorial/convert-standalone-to-replica-set/
The step I took were:
$sudo service mongodb stop
$sudo service mongod start
$mongo
>rs.initiate()
{
"info2" : "no configuration explicitly specified -- making one",
"me" : "staging3.domain.io:27017",
"info" : "Config now saved locally. Should come online in about a minute.",
"ok" : 1
}
singleNodeRepl:PRIMARY> rs.status()
{
"set" : "singleNodeRepl",
"date" : ISODate("2020-11-26T00:46:25Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "staging4.domain.io:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 1197,
"optime" : Timestamp(1606350415, 1),
"optimeDate" : ISODate("2020-11-26T00:26:55Z"),
"electionTime" : Timestamp(1606350415, 2),
"electionDate" : ISODate("2020-11-26T00:26:55Z"),
"self" : true
}
],
"ok" : 1
}
singleNodeRepl:PRIMARY> db.oplog.rs.find()
{ "ts" : Timestamp(1606350415, 1), "h" : NumberLong(0), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
At this point, it seems to have no issues.
However, my application is not being able to work as it did before.
Would really appreciate any help in troubleshooting the issue.
Thank you.
EDIT:
As suggested I included replSet in the config file instead passing it as an argument.
This is my config file:
# mongod.conf
#where to log
logpath=/var/log/mongodb/mongod.log
logappend=true
# fork and run in background
fork=true
#port=27017
dbpath=/var/lib/mongo
# location of pidfile
pidfilepath=/var/run/mongodb/mongod.pid
# Listen to local interface only. Comment out to listen on all interfaces.
#bind_ip=127.0.0.1
# Disables write-ahead journaling
# nojournal=true
# Enables periodic logging of CPU utilization and I/O wait
#cpu=true
# Turn on/off security. Off is currently the default
#noauth=true
#auth=true
# Verbose logging output.
verbose=true
# Inspect all client data for validity on receipt (useful for
# developing drivers)
#objcheck=true
# Enable db quota management
#quota=true
# Set oplogging level where n is
# 0=off (default)
# 1=W
# 2=R
# 3=both
# 7=W+some reads
#diaglog=0
# Ignore query hints
#nohints=true
# Enable the HTTP interface (Defaults to port 28017).
#httpinterface=true
# Turns off server-side scripting. This will result in greatly limited
# functionality
#noscripting=true
# Turns off table scans. Any query that would do a table scan fails.
#notablescan=true
# Disable data file preallocation.
#noprealloc=true
# Specify .ns file size for new databases.
# nssize=<size>
# Replication Options
# in replicated mongo databases, specify the replica set name here
replSet=singleNodeRepl
# maximum size in megabytes for replication operation log
#oplogSize=1024
# path to a key file storing authentication info for connections
# between replica set members
#keyFile=/path/to/keyfile
And verbose log file:
It does look like everything is working fine. However, my application is not able to connect to the DB as it did.
2020-11-26T00:26:55.852+0000 [conn1] replSet replSetInitiate admin command received from client
2020-11-26T00:26:55.853+0000 [conn1] replSet info initiate : no configuration specified. Using a default configuration for the set
2020-11-26T00:26:55.853+0000 [conn1] replSet created this configuration for initiation : { _id: "singleNodeRepl", members: [ { _id: 0, host: "staging4.domain.io:27017" } ] }
2020-11-26T00:26:55.853+0000 [conn1] replSet replSetInitiate config object parses ok, 1 members specified
2020-11-26T00:26:55.853+0000 [conn1] getMyAddrs(): [127.0.0.1] [10.20.26.228] [::1] [fe80::8ed:65ff:fe9e:15ab%eth0]
2020-11-26T00:26:55.853+0000 [conn1] getallIPs("staging4.domain.io"): [127.0.0.1]
2020-11-26T00:26:55.853+0000 [conn1] replSet replSetInitiate all members seem up
2020-11-26T00:26:55.853+0000 [conn1] ******
2020-11-26T00:26:55.853+0000 [conn1] creating replication oplog of size: 2570MB...
2020-11-26T00:26:55.853+0000 [conn1] create collection local.oplog.rs { size: 2695574937.6, capped: true, autoIndexId: false }
2020-11-26T00:26:55.853+0000 [conn1] Database::_addNamespaceToCatalog ns: local.oplog.rs
2020-11-26T00:26:55.866+0000 [conn1] ExtentManager::increaseStorageSize ns:local.oplog.rs desiredSize:2146426624 fromFreeList: 0 eloc: 1:2000
2020-11-26T00:26:55.876+0000 [conn1] ExtentManager::increaseStorageSize ns:local.oplog.rs desiredSize:549148160 fromFreeList: 0 eloc: 2:2000
2020-11-26T00:26:55.878+0000 [conn1] ******
2020-11-26T00:26:55.878+0000 [conn1] replSet info saving a newer config version to local.system.replset: { _id: "singleNodeRepl", version: 1, members: [ { _id: 0, host: "staging4.domain.io:27017" } ] }
2020-11-26T00:26:55.878+0000 [conn1] Database::_addNamespaceToCatalog ns: local.system.replset
2020-11-26T00:26:55.878+0000 [conn1] ExtentManager::increaseStorageSize ns:local.system.replset desiredSize:8192 fromFreeList: 0 eloc: 2:20bb8000
2020-11-26T00:26:55.878+0000 [conn1] Database::_addNamespaceToCatalog ns: local.system.replset.$_id_
2020-11-26T00:26:55.878+0000 [conn1] build index on: local.system.replset properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.system.replset" }
2020-11-26T00:26:55.878+0000 [conn1] local.system.replset: clearing plan cache - collection info cache reset
2020-11-26T00:26:55.878+0000 [conn1] allocating new extent
2020-11-26T00:26:55.878+0000 [conn1] ExtentManager::increaseStorageSize ns:local.system.replset.$_id_ desiredSize:131072 fromFreeList: 0 eloc: 2:20bba000
2020-11-26T00:26:55.878+0000 [conn1] added index to empty collection
2020-11-26T00:26:55.878+0000 [conn1] local.system.replset: clearing plan cache - collection info cache reset
2020-11-26T00:26:55.878+0000 [conn1] replSet saveConfigLocally done
2020-11-26T00:26:55.878+0000 [conn1] replSet replSetInitiate config now saved locally. Should come online in about a minute.
2020-11-26T00:26:55.878+0000 [conn1] command admin.$cmd command: replSetInitiate { replSetInitiate: undefined } keyUpdates:0 numYields:0 locks(micros) W:25362 reslen:206 25ms
2020-11-26T00:26:55.879+0000 [conn1] command test.$cmd command: isMaster { isMaster: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:270 0ms
2020-11-26T00:27:01.256+0000 [conn1] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:27:01.257+0000 [conn1] command test.$cmd command: isMaster { isMaster: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:27:10.688+0000 [conn1] query local.system.replset planSummary: COLLSCAN ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:97 nreturned:1 reslen:126 0ms
2020-11-26T00:27:10.689+0000 [conn1] command test.$cmd command: isMaster { isMaster: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:27:28.889+0000 [clientcursormon] connections:1
2020-11-26T00:27:33.333+0000 [conn1] end connection 127.0.0.1:50580 (0 connections now open)
2020-11-26T00:27:57.230+0000 [initandlisten] connection accepted from 127.0.0.1:50582 #2 (1 connection now open)
2020-11-26T00:27:57.230+0000 [conn2] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:62 0ms
2020-11-26T00:27:57.232+0000 [conn2] command admin.$cmd command: getLog { getLog: "startupWarnings" } keyUpdates:0 numYields:0 reslen:70 0ms
2020-11-26T00:27:57.233+0000 [conn2] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:28:00.237+0000 [conn2] command admin.$cmd command: serverStatus { serverStatus: 1.0 } keyUpdates:0 numYields:0 locks(micros) r:13 reslen:3402 0ms
2020-11-26T00:28:00.242+0000 [conn2] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:28:16.560+0000 [conn2] end connection 127.0.0.1:50582 (0 connections now open)
2020-11-26T00:32:28.904+0000 [clientcursormon] connections:0
2020-11-26T00:36:32.398+0000 [initandlisten] connection accepted from 127.0.0.1:50588 #3 (1 connection now open)
2020-11-26T00:36:32.398+0000 [conn3] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:62 0ms
2020-11-26T00:36:32.399+0000 [conn3] command admin.$cmd command: getLog { getLog: "startupWarnings" } keyUpdates:0 numYields:0 reslen:70 0ms
2020-11-26T00:36:32.400+0000 [conn3] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:36:34.603+0000 [conn3] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:36:37.326+0000 [conn3] query local.oplog.rs planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:66 nreturned:1 reslen:106 0ms
2020-11-26T00:36:37.328+0000 [conn3] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:37:28.832+0000 [initandlisten] connection accepted from 10.20.37.160:54484 #4 (2 connections now open)
2020-11-26T00:37:28.832+0000 [conn4] command admin.$cmd command: isMaster { isMaster: 1, compression: [], client: { driver: { name: "mongo-ruby-driver", version: "2.13.1" }, os: { type: "linux", name: "linux-gnu", architecture: "x86_64" }, platform: "mongoid-6.4.1, Ruby 2.6.5, x86_64-linux, x86_64-pc-linux-gnu" } } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:37:28.919+0000 [clientcursormon] connections:2
2020-11-26T00:37:33.568+0000 [initandlisten] connection accepted from 10.20.37.160:54492 #5 (3 connections now open)
2020-11-26T00:37:33.569+0000 [conn5] command admin.$cmd command: isMaster { isMaster: 1, compression: [], client: { driver: { name: "mongo-ruby-driver", version: "2.13.1" }, os: { type: "linux", name: "linux-gnu", architecture: "x86_64" }, platform: "mongoid-6.4.1, Ruby 2.6.5, x86_64-linux, x86_64-pc-linux-gnu" } } keyUpdates:0 numYields:0 reslen:367 0ms
2020-11-26T00:37:36.586+0000 [conn3] end connection 127.0.0.1:50588 (2 connections now open)
2020-11-26T00:39:35.621+0000 [initandlisten] connection accepted from 127.0.0.1:50592 #6 (3 connections now open)
2020-11-26T00:39:35.621+0000 [conn6] command admin.$cmd command: whatsmyuri { whatsmyuri: 1 } ntoreturn:1 keyUpdates:0 numYields:0 reslen:62 0ms
2020-11-26T00:39:35.622+0000 [conn6] command admin.$cmd command: getLog { getLog: "startupWarnings" } keyUpdates:0 numYields:0 reslen:70 0ms
2020-11-26T00:39:35.623+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:37.589+0000 [conn6] opening db: test
2020-11-26T00:39:37.589+0000 [conn6] query test.oplog.rs planSummary: EOF ntoreturn:0 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 numYields:0 locks(micros) W:186 r:19 nreturned:0 reslen:20 0ms
2020-11-26T00:39:37.590+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:41.891+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:43.266+0000 [conn6] query local.oplog.rs planSummary: COLLSCAN ntoreturn:0 ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 numYields:0 locks(micros) r:62 nreturned:1 reslen:106 0ms
2020-11-26T00:39:43.268+0000 [conn6] command admin.$cmd command: replSetGetStatus { replSetGetStatus: 1.0, forShell: 1.0 } keyUpdates:0 numYields:0 reslen:300 0ms
2020-11-26T00:39:52.681+0000 [conn6] end connection 127.0.0.1:50592 (2 connections now open)
2020-11-26T00:42:28.934+0000 [clientcursormon] connections:2
You should not mix using a config file, i.e.
mongod --config /etc/mongod.conf
and command line options
mongod --replSet rs0 --bind_ip localhost
Most likely in your config you did not set in /etc/mongod.conf
replication:
replSetName: <string>
So, when you start your MongoDB with service mongodb start then you may have a different configuration.
Note, check the service file (in my Redhat at /etc/systemd/system/mongod.service) which may point even to a different .conf file.

Slow mongodb FindOneAndUpdate

I'm having trouble with the following findOneAndUpdate MongoDB query:
planSummary: IXSCAN { id: 1 } keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:0 reslen:3044791
locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } }
storage:{} protocol:op_query 135ms
writeConcern: { w: 0, j: false }
As you can see it has execution time of +100 ms. The query part uses an index and takes less than 1ms (using 'Explain query'). So it's the write part that is slow.
The Mongo instance is the master of a 3 member replica set. Write concern is set to 0 and journaling is disabled.
What could be the cause of the slow write? Could it be the update of indices?
MongoDB version 4.0
Driver: Node.js native mongodb version 3.2
Edit: I think it might be the length of the result. After querying a document smaller in size, the execution time is halved.
reslen:3044791
This was the source of the bad performance. Reducing this by adding a projection option to only return a specific field improved the execution from ~90ms on average to ~7ms.

My mongoimport runs to infinity

I did a mongorestore of a gzipped mongodump:
mongorestore -v --drop --gzip --db bigdata /Volumes/Lacie2TB/backup/mongo20170909/bigdata/
But it kept going. I left it, because I figure if I 'just' close it now, my (important) data will be corrupted. Check the percentages:
2017-09-10T14:45:58.385+0200 [########################] bigdata.logs.sets.log 851.8 GB/85.2 GB (999.4%)
2017-09-10T14:46:01.382+0200 [########################] bigdata.logs.sets.log 852.1 GB/85.2 GB (999.7%)
2017-09-10T14:46:04.381+0200 [########################] bigdata.logs.sets.log 852.4 GB/85.2 GB (1000.0%)
And it keeps going!
Note that the other collections have finished. Only this one goes beyond 100%. I do not understand.
This is mongo 3.2.7 on Mac OSX.
There is obviously a problem with the amount of data imported, because there is not even that much diskspace.
$ df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk3 477Gi 262Gi 214Gi 56% 68749708 56193210 55% /
The amount of disk space used could be right, because the gzipped backup is about 200GB. I do not know if this would result in the same amount of data on the WiredTiger database with snappy compression.
However, the log keeps showing inserts:
2017-09-10T16:20:18.986+0200 I COMMAND [conn9] command bigdata.logs.sets.log command: insert { insert: "logs.sets.log", documents: 20, writeConcern: { getLastError: 1, w: 1 }, ordered: false } ninserted:20 keyUpdates:0 writeConflicts:0 numYields:0 reslen:40 locks:{ Global: { acquireCount: { r: 19, w: 19 } }, Database: { acquireCount: { w: 19 } }, Collection: { acquireCount: { w: 19 } } } protocol:op_query 245ms
2017-09-10T16:20:19.930+0200 I COMMAND [conn9] command bigdata.logs.sets.log command: insert { insert: "logs.sets.log", documents: 23, writeConcern: { getLastError: 1, w: 1 }, ordered: false } ninserted:23 keyUpdates:0 writeConflicts:0 numYields:0 reslen:40 locks:{ Global: { acquireCount: { r: 19, w: 19 } }, Database: { acquireCount: { w: 19 } }, Collection: { acquireCount: { w: 19 } } } protocol:op_query 190ms
update
Disk space is still being consumed. This is roughly 2 hours later, and roughly 30 GB later:
$ df -h
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk3 477Gi 290Gi 186Gi 61% 76211558 48731360 61% /
The question is: Is there a bug in the progress indicator, or is there some kind of loop that keeps inserting the same documents?
Update
It finished.
2017-09-10T19:35:52.268+0200 [########################] bigdata.logs.sets.log 1604.0 GB/85.2 GB (1881.8%)
2017-09-10T19:35:52.268+0200 restoring indexes for collection bigdata.logs.sets.log from metadata
2017-09-10T20:16:51.882+0200 finished restoring bigdata.logs.sets.log (3573548 documents)
2017-09-10T20:16:51.882+0200 done
604.0 GB/85.2 GB (1881.8%)
Interesting. :)
It looks similar to this bug: https://jira.mongodb.org/browse/TOOLS-1579
There seems to be a fix backported to 3.5 and 3.4. The fix might not be backported to 3.2. I'm thinking the problem might have something to do with using gzip and/or snappy compression.

Mongo Auto Balancing Not Working

I'm running into an issue where one of my shards is constantly at 100% CPU usage while I'm storing files into my Mongo DB (using Grid FS). I have shutdown writing to the DB and the usage does drop down to nearly 0%. However, the auto balancer is on and does not appear to be auto balancing anything. I have roughly 50% of my data on that one shard with nearly 100% CPU usage and virtually all the others are at 7-8%.
Any ideas?
mongos> version()
3.0.6
Auto Balancing Enabled
Storage Engine: WiredTiger
I have this general architecture:
2 - routers
3 - config server
8 - shards (2 shards per server - 4 servers)
No replica sets!
https://docs.mongodb.org/v3.0/core/sharded-cluster-architectures-production/
Log Details
Router 1 Log:
2016-01-15T16:15:21.714-0700 I NETWORK [conn3925104] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] Socket recv() timeout [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] SocketException: remote: [IP]:[port] error: 9001 socket exception [RECV_TIMEOUT] server [IP]:[port]
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] DBClientCursor::init call() failed
2016-01-15T16:15:23.256-0700 I NETWORK [LockPinger] scoped connection to [IP]:[port],[IP]:[port],[IP]:[port] not being returned to the pool
2016-01-15T16:15:23.256-0700 W SHARDING [LockPinger] distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579303:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::update prepare failed: [IP]:[port] (IP) failed:10276 DBClientBase::findN: transport error: [IP]:[port] ns: admin.$cmd query: { getlasterror: 1, fsync: 1 }
2016-01-15T16:15:24.715-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925105 (64 connections now open)
2016-01-15T16:15:24.715-0700 I NETWORK [conn3925105] end connection [IP]:[port] (63 connections now open)
2016-01-15T16:15:27.717-0700 I NETWORK [mongosMain] connection accepted from [IP]:[port] #3925106 (64 connections now open)
2016-01-15T16:15:27.718-0700 I NETWORK [conn3925106] end connection [IP]:[port](63 connections now open)
Router 2 Log:
2016-01-15T16:18:21.762-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e3d110ccb8e38549a9d
2016-01-15T16:18:24.316-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:24 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:18:24.978-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:35.295-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e4a110ccb8e38549a9f
2016-01-15T16:18:38.507-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:48.838-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e58110ccb8e38549aa1
2016-01-15T16:18:52.038-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
2016-01-15T16:18:54.660-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:18:54 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1442579454:1804289383', sleeping for 30000ms
2016-01-15T16:19:02.323-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' acquired, ts : 56997e66110ccb8e38549aa3
2016-01-15T16:19:05.513-0700 I SHARDING [Balancer] distributed lock 'balancer/[IP]:[port]:1442579454:1804289383' unlocked.
Problematic Shard Log:
2016-01-15T16:21:03.426-0700 W SHARDING [conn40] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 46 took 17364ms
2016-01-15T16:21:03.484-0700 I COMMAND [conn40] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 17421ms
2016-01-15T16:21:03.775-0700 I SHARDING [LockPinger] cluster [IP]:[port],[IP]:[port],[IP]:[port] pinged successfully at Fri Jan 15 16:21:03 2016 by distributed lock pinger '[IP]:[port],[IP]:[port],[IP]:[port]/[IP]:[port]:1441718306:765353801', sleeping for 30000ms
2016-01-15T16:21:04.321-0700 I SHARDING [conn40] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:08.243-0700 I SHARDING [conn46] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
2016-01-15T16:21:10.174-0700 W SHARDING [conn37] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 60 took 18516ms
2016-01-15T16:21:10.232-0700 I COMMAND [conn37] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18574ms
2016-01-15T16:21:10.989-0700 W SHARDING [conn25] Finding the split vector for Files.fs.chunks over { files_id: 1.0, n: 1.0 } keyCount: 137 numSplits: 200715 lookedAt: 62 took 18187ms
2016-01-15T16:21:11.047-0700 I COMMAND [conn25] command admin.$cmd command: splitVector { splitVector: "Files.fs.chunks", keyPattern: { files_id: 1.0, n: 1.0 }, min: { files_id: ObjectId('5650816c827928d710ef5ef9'), n: 1 }, max: { files_id: MaxKey, n: MaxKey }, maxChunkSizeBytes: 67108864, maxSplitPoints: 0, maxChunkObjects: 250000 } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:216396 reslen:8318989 locks:{ Global: { acquireCount: { r: 432794 } }, Database: { acquireCount: { r: 216397 } }, Collection: { acquireCount: { r: 216397 } } } 18246ms
2016-01-15T16:21:11.365-0700 I SHARDING [conn37] request split points lookup for chunk Files.fs.chunks { : ObjectId('5650816c827928d710ef5ef9'), : 1 } -->> { : MaxKey, : MaxKey }
For the splitting error - Upgrading to Mongo v.3.0.8+ resolved it
Still having an issue with the balancing itself...shard key is an md5 check sum so unless they all have very similar md5s (not very likely) there is still investigating to do....using range based partitioning
there are multiple ways to check
db.printShardingStatus() - this will give all collections sharded and whether auto balancer is on and current collection taken for sharding from when
sh.status(true) - this will give chunk level details. Look whether your chunk has jumbo:true . In case chunk is marked as jumbo it will not be split properly.
db.collection.stats() -- this will give collection stats and see each shard distribution details there

Mongodb crashed with Got signal: 11 (Segmentation fault)

My mongo server crashed with the following log. My mongo server is of version 2.4.2 and my mongo java client is 2.11.2. My environment is RHEL
Please let know as to what could eb the problem.. I see from other threads that the older version before 2.2.x had this problem, but mine is 2.4.2. Any help..
...
Thu Feb 20 13:45:56.924 [conn78956] run command admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.924 [conn78956] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 keyUpdates:0 reslen:263 0ms
Thu Feb 20 13:45:56.938 [conn78962] runQuery called admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.938 [conn78962] run command admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.938 [conn78962] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 keyUpdates:0 reslen:263 0ms
Thu Feb 20 13:45:56.938 [conn78965] runQuery called admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.938 [conn78965] run command admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.938 [conn78965] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 keyUpdates:0 reslen:263 0ms
Thu Feb 20 13:45:56.938 [conn78964] runQuery called admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.938 [conn78964] run command admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.938 [conn78964] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 keyUpdates:0 reslen:263 0ms
Thu Feb 20 13:45:56.941 [rsHealthPoll] replSet member 204.27.36.236:5000 is up
Thu Feb 20 13:45:56.941 [rsHealthPoll] replSet member 204.27.36.236:5000 is now in state SECONDARY
Thu Feb 20 13:45:56.941 [rsMgr] replSet warning caught unexpected exception in electSelf()
Thu Feb 20 13:45:56.941 Invalid access at address: 0 from thread:
Thu Feb 20 13:45:56.941 Got signal: 11 (Segmentation fault).
Thu Feb 20 13:45:56.941 [conn78959] runQuery called admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.941 [conn78959] run command admin.$cmd { ismaster: 1 }
Thu Feb 20 13:45:56.941 [conn78959] command admin.$cmd command: { ismaster: 1 } ntoreturn:1 keyUpdates:0 reslen:263 0ms
Thu Feb 20 13:45:56.943 Backtrace:
0xdced21 0x6cf749 0x6cfcd2 0x30e160f4c0
/home/myserver/mySer/db/mongodb/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdced21]
/home/myserver/mySer/db/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cf749]
/home/myserver/mySer/db/mongodb/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6cfcd2]
/lib64/libpthread.so.0() [0x30e160f4c0]