I was practicing some MapReduce inside of my Primary's mongo shell when it suddenly became a Secondary. I SSHed into the two other VM's with the other secondaries, and discovered that the mongod's had been rendered inoperaple. I killed them and I issued the mongod --config /etc/mongod.conf to kick them off and I entered the mongo shell. After a few seconds they were interrupted with:
2014-09-14T22:29:54.142-0500 DBClientCursor::init call() failed
2014-09-14T22:29:54.143-0500 trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
2014-09-14T22:29:54.143-0500 warning: Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2014-09-14T22:29:54.143-0500 reconnect 127.0.0.1:27017 (127.0.0.1) failed failed couldn't connect to server 127.0.0.1:27017 (127.0.0.1), connection attempt failed
>
This is from their (the two original secondaries in the replicaset) logs:
2014-09-14T22:09:21.879-0500 [rsBackgroundSync] replSet syncing to: vm-billing-001:27017
2014-09-14T22:09:21.880-0500 [rsSync] replSet still syncing, not yet to minValid optime 54165090:1
2014-09-14T22:09:21.882-0500 [rsBackgroundSync] replset setting syncSourceFeedback to vm-billing-001:27017
2014-09-14T22:09:21.886-0500 [rsSync] replSet SECONDARY
2014-09-14T22:09:21.886-0500 [repl writer worker 1] build index on: test.tmp.mr.CCS.nonconforming_1_inc properties: { v: 1, key: { 0: 1 }, name: "_temp_0", ns: "test.tmp.mr.CCS.nonconforming_1_inc" }
2014-09-14T22:09:21.887-0500 [repl writer worker 1] added index to empty collection
2014-09-14T22:09:21.887-0500 [repl writer worker 1] build index on: test.tmp.mr.CCS.nonconforming_1 properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.tmp.mr.CCS.nonconforming_1" }
2014-09-14T22:09:21.887-0500 [repl writer worker 1] added index to empty collection
2014-09-14T22:09:21.888-0500 [repl writer worker 1] build index on: test.tmp.mr.CCS.nonconforming_1 properties: { v: 1, unique: true, key: { id: 1.0 }, name: "id_1", ns: "test.tmp.mr.CCS.nonconforming_1" }
2014-09-14T22:09:21.888-0500 [repl writer worker 1] added index to empty collection
2014-09-14T22:09:21.891-0500 [repl writer worker 2] ERROR: writer worker caught exception: :: caused by :: 11000 insertDocument :: caused by :: 11000 E11000 duplicate key error index: cisco.tmp.mr.CCS.nonconforming_1.$id_1 dup key: { : null } on: { ts: Timestamp 1410748561000|46, h: 9014687153249982311, v: 2, op: "i", ns: "cisco.tmp.mr.CCS.nonconforming_1", o: { _id: 14, value: 1.0 } }
2014-09-14T22:09:21.891-0500 [repl writer worker 2] Fatal Assertion 16360
2014-09-14T22:09:21.891-0500 [repl writer worker 2]
I can issue mongo --host ... --port ... from both of the two VMs that can't start the mongo to the original primary mongo, but I do see some connection refused notes above in the error log.
My original primary mongod can still be connected to in the mongo shell, but it is a primary. I can kill it and restart it and it will start up in secondary.
How can I roll back to the last known state and restart my replica set?
Related
My replica set has two nodes:
1: the master node
2: a slave node with priority:0, votes:0
The oplog size is 5000MB.
run this for loop in master shell:
for (i=0;i<1000000;i++)
{
db.getSiblingDB("ff").c.insert(
{ a:i,
d:i+".#234"+(++i)+".234546"+(++i)+".568679"+(++i)+"31234."+(++i)+".12342354"+(++i)+"5346457."+(++i)+"33543465456."+(++i)+".6346456"+(++i)+"123235434."+(++i)+".2345345345"+(++i)
}
)
}
Kill the slave node while the for loop is running: kill -9 $(pidof slave_node)
Stop the for loop after a second; then restart the slave node.
Then run db.getSiblingDB("ff").c.count() to check data in both slave and master nodes, with the results:
master:20w
slave:15w
The slave node can catch up with the primary, but there is a lot of data lost from the slave.
Why is this?
Here is the slave node's log as it restarts after being killed:
2017-11-27T05:53:53.873+0000 I NETWORK [thread1] waiting for connections on port 28006
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] New replica set config in use: { _id: "cpconfig2", version: 2, protocolVersion: 1, members: [ { _id: 0, host: "127.0.0.1:28007", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "127.0.0.1:28006", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 0.0, tags: {}, slaveDelay: 0, votes: 0 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5a1ba5bbb0a652502a5f002a') } }
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] This node is 127.0.0.1:28006 in the config
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] transition to STARTUP2
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] Starting replication storage threads
2017-11-27T05:53:53.877+0000 I REPL [replExecDBWorker-0] Starting replication fetcher thread
2017-11-27T05:53:53.877+0000 I REPL [replExecDBWorker-0] Starting replication applier thread
2017-11-27T05:53:53.877+0000 I REPL [replExecDBWorker-0] Starting replication reporter thread
2017-11-27T05:53:53.877+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:28007
2017-11-27T05:53:53.877+0000 I REPL [rsSync] transition to RECOVERING
2017-11-27T05:53:53.878+0000 I REPL [rsSync] transition to SECONDARY
2017-11-27T05:53:53.879+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 127.0.0.1:28007, took 2ms (1 connections now open to 127.0.0.1:28007)
2017-11-27T05:53:53.879+0000 I REPL [ReplicationExecutor] Member 127.0.0.1:28007 is now in state PRIMARY
2017-11-27T05:53:54.011+0000 I FTDC [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
2017-11-27T05:53:54.645+0000 I NETWORK [thread1] connection accepted from 127.0.0.1:52404 #1 (1 connection now open)
2017-11-27T05:53:54.645+0000 I NETWORK [conn1] received client metadata from 127.0.0.1:52404 conn1: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.9" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 8 (jessie)"", architecture: "x86_64", version: "Kernel 3.10.0" } }
2017-11-27T05:53:59.878+0000 I REPL [rsBackgroundSync] sync source candidate: 127.0.0.1:28007
See the page Accuracy after Unexpected Shutdown for details and information on how to recover from this situation.
mongodb:v3.4.9
error logs from mongodb c++ source code: https://github.com/mongodb/mongo/blob/367d31e1da549c460ae710a8cc280f4c235ab24f/src/mongo/s/client/shard_registry.cpp#L384
Mongos throw error when i add new node to a shard cluster and all the enableShardCollection can't query (ExceededTimeLimit).
It's can be repaired?
Marking host config.app.com as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 871 -- target:config.app.com db:config expDate:2017-10-21T13:16:38.250+0000 cmd:{ find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508586527000|1, t: 24 } }, maxTimeMS: 30000 }
2017-10-21T13:16:38.250+0000 I SHARDING [shard registry reload] Operation timed out :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 871 -- target:config.app.com db:config expDate:2017-10-21T13:16:38.250+0000 cmd:{ find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508586527000|1, t: 24 } }, maxTimeMS: 30000 }
2017-10-21T13:16:38.250+0000 I SHARDING [shard registry reload] Periodic reload of shard registry failed :: caused by :: 50 could not get updated shard list from config server due to Operation timed out, request was RemoteCommand 871 -- target:config.app.com db:config expDate:2017-10-21T13:16:38.250+0000 cmd:{ find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508586527000|1, t: 24 } }, maxTimeMS: 30000 }; will retry after 30s
This happened to me and after hours of debugging I found that my config server was started without the configsvr: true option in rs.initiate. So mongos was requesting data from my config server but the config server didn't know how to respond. fwiw, I had
sharding:
clusterRole: configsvr
in my conf file but looks like that wasn't picked up.
we are trying to move from mongo 2.4.9 to 3.4, we have a lot of data so we tried to set replication and wait while data will be synced and then swap primary.
Configurations done but when replication is initiated new server cant stabilize replication:
017-07-07T12:07:22.492+0000 I REPL [replication-1] Starting initial sync (attempt 10 of 10)
2017-07-07T12:07:22.501+0000 I REPL [replication-1] sync source candidate: mongo-2.blabla.com:27017
2017-07-07T12:07:22.501+0000 I STORAGE [replication-1] dropAllDatabasesExceptLocal 1
2017-07-07T12:07:22.501+0000 I REPL [replication-1] ******
2017-07-07T12:07:22.501+0000 I REPL [replication-1] creating replication oplog of size: 6548MB...
2017-07-07T12:07:22.504+0000 I STORAGE [replication-1] WiredTigerRecordStoreThread local.oplog.rs already started
2017-07-07T12:07:22.505+0000 I STORAGE [replication-1] The size storer reports that the oplog contains 0 records totaling to 0 bytes
2017-07-07T12:07:22.505+0000 I STORAGE [replication-1] Scanning the oplog to determine where to place markers for truncation
2017-07-07T12:07:22.519+0000 I REPL [replication-1] ******
2017-07-07T12:07:22.521+0000 I REPL [replication-1] Initial sync attempt finishing up.
2017-07-07T12:07:22.521+0000 I REPL [replication-1] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 9, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1499429233163), initialSyncAttempts: [ { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" }, { durationMillis: 0, status: "CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find", syncSource: "mongo-2.blabla.com:27017" } ] }
2017-07-07T12:07:22.521+0000 E REPL [replication-1] Initial sync attempt
failed -- attempts left: 0 cause: CommandNotFound: error while getting last
oplog entry for begin timestamp: no such cmd: find
2017-07-07T12:07:22.521+0000 F REPL [replication-1] The maximum number
of retries have been exhausted for initial sync.
2017-07-07T12:07:22.522+0000 E REPL [replication-0] Initial sync failed,
shutting down now. Restart the server to attempt a new initial sync.
2017-07-07T12:07:22.522+0000 I - [replication-0] Fatal assertion 40088 CommandNotFound: error while getting last oplog entry for begin timestamp: no such cmd: find at src/mongo/db/repl/replication_coordinator_impl.cpp 632
please assits guys, since we have more than 100G of data, so dump and restore will take a lot of downtime
Configurations:
3.4.5 new machine:
storage:
dbPath: /mnt/dbpath
journal:
enabled: true
engine: wiredTiger
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
net:
port: 27017
replication:
replSetName: prodTest
2.4.9 old machine with data:
dbpath=/var/lib/mongodb
logpath=/var/log/mongodb/mongodb.log
logappend=true port = 27017
the task have been solved in such way:
-create replica master-v2.4, 3 slaves-v2.6
-stop app, step down master
-stop new master and upgrade mongo version to v3.0,
start master and upgrade slaves sequentually to 3.2(slave db files
removed new version started on wiredTiger engine)
-step down master, upgrade all slaves to 3.4
This process become very fast because replica slave recovery of 40G db takes around 30m.
i am trying to dump a mongodb collection to file, and then use that to restore to another mongodb instance.
dumping -
mongodump --host 127.0.0.1 --port 27017 --username vespauser --password <passwd> --collection vespastats --db vespa --out /archive/vespa-archive/vespa-db-backup_001
connected to: 127.0.0.1:27017
2015-04-21T16:24:07.070-0400 DATABASE: vespa to /archive/vespa-archive/vespa-db-backup_testing01/vespa
2015-04-21T16:24:07.141-0400 vespa.system.indexes to /archive/vespa-archive/vespa-db-backup_testing01/vespa/system.indexes.bson
2015-04-21T16:24:07.148-0400 4 documents
2015-04-21T16:24:07.149-0400 vespa.vespastats to /archive/vespa-archive/vespa-db-backup_testing01/vespa/vespastats.bson
2015-04-21T16:24:07.316-0400 59724 documents
2015-04-21T16:24:08.118-0400 Metadata for vespa.vespastats to /archive/vespa-archive/vespa-db-backup_testing01/vespa/vespastats.metadata.json
restoring -
mongorestore -v --drop --host 127.0.0.1 --port 27017 --username admin --password <passwd> /archive/vespa-archive/vespa-db-backup_001
2015-04-21T16:31:11.962-0400 creating new connection to:127.0.0.1:27017
2015-04-21T16:31:11.963-0400 [ConnectBG] BackgroundJob starting: ConnectBG
2015-04-21T16:31:11.963-0400 connected to server 127.0.0.1:27017 (127.0.0.1)
2015-04-21T16:31:11.963-0400 connected connection!
connected to: 127.0.0.1:27017
2015-04-21T16:31:11.966-0400 /home/amurty/vespa-db/vespa-db-backup_testing01/vespa/vespastats.bson
2015-04-21T16:31:11.966-0400 going into namespace [vespa.vespastats]
2015-04-21T16:31:11.966-0400 dropping
file size: 88808161
59724 objects found
2015-04-21T16:31:13.730-0400 Creating index: { key: { _id: 1 }, name: "_id_", ns: "vespa.vespastats" }
2015-04-21T16:31:13.848-0400 Creating index: { key: { url: 1 }, name: "url_1", ns: "vespa.vespastats", background: true }
2015-04-21T16:31:13.858-0400 Creating index: { key: { r_tstpm: 1 }, name: "r_tstpm_1", ns: "vespa.vespastats", background: true }
2015-04-21T16:31:13.859-0400 Creating index: { key: { url: 1, r_tstpm: 1 }, name: "url_1_r_tstpm_1", ns: "vespa.vespastats", background: true }
from /var/log/mongodb/mongod.log -
2015-04-21T16:31:11.963-0400 [initandlisten] connection accepted from 127.0.0.1:58444 #23 (1 connection now open)
2015-04-21T16:31:11.964-0400 [conn23] authenticate db: admin { authenticate: 1, nonce: "xxx", user: "admin", key: "xxx" }
2015-04-21T16:31:11.968-0400 [conn23] CMD: drop vespa.vespastats
2015-04-21T16:31:13.757-0400 [conn23] allocating new ns file /var/lib/mongo/vespa.ns, filling with zeroes...
2015-04-21T16:31:13.838-0400 [FileAllocator] allocating new datafile /var/lib/mongo/vespa.0, filling with zeroes...
2015-04-21T16:31:13.846-0400 [FileAllocator] done allocating datafile /var/lib/mongo/vespa.0, size: 64MB, took 0.007 secs
2015-04-21T16:31:13.847-0400 [conn23] build index on: vespa.vespastats properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "vespa.vespastats" }
2015-04-21T16:31:13.848-0400 [conn23] added index to empty collection
2015-04-21T16:31:13.857-0400 [conn23] build index on: vespa.vespastats properties: { v: 1, key: { url: 1 }, name: "url_1", ns: "vespa.vespastats", background: true }
2015-04-21T16:31:13.857-0400 [conn23] added index to empty collection
2015-04-21T16:31:13.858-0400 [conn23] build index on: vespa.vespastats properties: { v: 1, key: { r_tstpm: 1 }, name: "r_tstpm_1", ns: "vespa.vespastats", background: true }
2015-04-21T16:31:13.859-0400 [conn23] added index to empty collection
2015-04-21T16:31:13.860-0400 [conn23] build index on: vespa.vespastats properties: { v: 1, key: { url: 1, r_tstpm: 1 }, name: "url_1_r_tstpm_1", ns: "vespa.vespastats", background: true }
2015-04-21T16:31:13.860-0400 [conn23] added index to empty collection
2015-04-21T16:31:13.862-0400 [conn23] end connection 127.0.0.1:58444 (0 connections now open)
now when i login to my new mongodb instance and check collection size, i get a big 0 -
# mongo
MongoDB shell version: 2.6.9
connecting to: test
> use vespa
switched to db vespa
> db.auth('vespauser', '<paswd>')
1
> db.vespastats.find()
> db.vespastats.count()
0
>
Collection may or may not exist in the used database but the query is not returning an error, just 0.
db.vespastats.find().count()
The issue should be because it is added to database test. (doc mentions it should be automatic but I was able to reproduce this behaviour).
Therefore
use test
db.vespastats.find().count()
would have returned the actual documents in the collection vespastats.
The issue is caused by not specifying db name when using mongo binary command mongorestore. doc for mongorestore mongorestore --nsInclude=vesta.vestastats should be the updated version (even if -d still works).
To know where the collection would land, I would run 2 times the restore dump and check show dbs in mongo shell 3 times (before and after) > the db size is changing (not immediately though as it may show 8kb right after the restoration).
i have a 3-server replset, and today i found some strange messages in mongodb.log which never appeared before:
Fri Feb 20 10:17:12 [rsSync] Assertion: 10334:Invalid BSONObj size:
-286331154 (0xEEEEEEEE) first element: _id: ObjectId('54d96ab42d46d5c91edf6651') 0x588cb2 0x5077a1 0x8ba7f8
0x94572e 0x949e45 0x94c395 0x830411 0x82100b 0x821d85 0x8231d8
0x82439a 0x824820 0xaa4560 0x3ecf40683d 0x3ececd4f8d
/opt/mongo/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x588cb2]
/opt/mongo/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x471)
[0x5077a1] /opt/mongo/bin/mongod(_ZNK5mongo7DiskLoc3objEv+0xa8)
[0x8ba7f8] /opt/mongo/bin/mongod [0x94572e]
/opt/mongo/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x2b85)
[0x949e45]
/opt/mongo/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x125)
[0x94c395]
/opt/mongo/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEb+0xb81)
[0x830411]
/opt/mongo/bin/mongod(_ZN5mongo11ReplSetImpl9syncApplyERKNS_7BSONObjE+0x1fb)
[0x82100b]
/opt/mongo/bin/mongod(_ZN5mongo11ReplSetImpl8syncTailEv+0xce5)
[0x821d85]
/opt/mongo/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0xc8)
[0x8231d8]
/opt/mongo/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x4a)
[0x82439a] /opt/mongo/bin/mongod(_ZN5mongo15startSyncThreadEv+0xa0)
[0x824820] /opt/mongo/bin/mongod(thread_proxy+0x80) [0xaa4560]
/lib64/libpthread.so.0 [0x3ecf40683d] /lib64/libc.so.6(clone+0x6d)
[0x3ececd4f8d] Fri Feb 20 10:17:12 [rsSync] replSet syncTail: 10334
Invalid BSONObj size: -286331154 (0xEEEEEEEE) first element: _id:
ObjectId('54d96ab42d46d5c91edf6651'), syncing: { ts: Timestamp
1424343003000|76, h: -3506864587493877515, v: 2, op: "u", ns:
"gingko.docindex", o2: { _id: ObjectId('54c220444a9d689b8ca0bb59') },
o: { $set: { l: 0 } } }
when i found these, mongo version is 1.4.9, then i update to 2.6.7, and restart the replset. after that, the error keep happening, and mongod process exit.
2015-02-20T11:05:11.962+0800 [repl writer worker 1] Assertion:
10334:BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be
between 0 and 16793600(16MB) First element: _id:
ObjectId('54d96ab42d46d5c91edf6651') 2015-02-20T11:05:11.972+0800
[repl writer worker 1] gingko.docindex 0x11fd1b1 0x119efa9 0x1183b56
0x11840ac 0x77402b 0xf02038 0xd34f4d 0xc4d1e9 0xc58f46 0xc47be7
0xe5684b 0xeba03e 0xeba950 0x119284e 0x1241b49 0x3ecf40683d
0x3ececd4f8d mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11fd1b1]
mongod(_ZN5mongo10logContextEPKc+0x159) [0x119efa9]
mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x1183b56] mongod
[0x11840ac] mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x41b)
[0x77402b] mongod(_ZNK5mongo7DiskLoc3objEv+0x68) [0xf02038]
mongod(_ZN5mongo12IDHackRunner7getNextEPNS_7BSONObjEPNS_7DiskLocE+0x33d)
[0xd34f4d]
mongod(_ZN5mongo6updateERKNS_13UpdateRequestEPNS_7OpDebugEPNS_12UpdateDriverEPNS_14CanonicalQueryE+0x9a9)
[0xc4d1e9] mongod(_ZN5mongo14UpdateExecutor7executeEv+0x66)
[0xc58f46]
mongod(_ZN5mongo6updateERKNS_13UpdateRequestEPNS_7OpDebugE+0x27)
[0xc47be7]
mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x199b)
[0xe5684b]
mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x4fe)
[0xeba03e]
mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x50)
[0xeba950] mongod(_ZN5mongo10threadpool6Worker4loopEv+0x19e)
[0x119284e] mongod [0x1241b49] /lib64/libpthread.so.0 [0x3ecf40683d]
/lib64/libc.so.6(clone+0x6d) [0x3ececd4f8d]
2015-02-20T11:05:11.972+0800 [repl writer worker 1] ERROR: writer
worker caught exception: :: caused by :: 10334 BSONObj size:
-286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: _id:
ObjectId('54d96ab42d46d5c91edf6651') on: { ts: Timestamp
1424343003000|76, h: -3506864587493877515, v: 2, op: "u", ns:
"gingko.docindex", o2: { _id: ObjectId('54c220444a9d689b8ca0bb59') },
o: { $set: { l: 0 } } } 2015-02-20T11:05:11.972+0800 [repl writer
worker 1] Fatal Assertion 16360 2015-02-20T11:05:11.972+0800 [repl
writer worker 1]
***aborting after fassert() failure
the table "gingko.docindex" have 120 million records, read heavy, write not so heavy. by the way, another slave is all right when syncing, and the whole replset was built up 3 years ago.
i search a lot for this, but didn't get any similar problem, any one can give any advise? thanks a lot!
lvzheng