MongoDB storageEngine from MMAPv1 to wiredTiger fassert() failure - mongodb

I am upgrading my cluster to wiredTiger using this site: https://docs.mongodb.org/manual/tutorial/change-replica-set-wiredtiger/
I have been having the following issue:
Environment details:
MongoDB 3.0.9 in a sharded cluster on Red Hat Enterprise Linux Server release 6.2 (Santiago). I have 4 shards, each one is a replica set with 3 members. I just recently upgraded all binaries from 2.4 to 3.0.9. Every server has updated binaries, I tried converting each replica set to wired tiger storage engine, but I was getting the following error when upgrading the secondary on one member server (shard 1):
2016-02-09T12:36:39.366-0500 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 56b9c217:ab newest timestamp: 56b9b429:60. Op being applied: { ts: Timestamp 1455010857000|96, h: 2267356763748731326, v: 2, op: "d", ns: "General.Tickets", fromMigrate: true, b: true, o: { _id: ObjectId('566aec7bdfd4b700e73d64db') }
2016-02-09T12:36:39.366-0500 I - [rsSync] Fatal Assertion 18905
2016-02-09T12:36:39.366-0500 I - [rsSync]
***aborting after fassert() failure
This is an open bug with replication: https://jira.mongodb.org/browse/SERVER-17081
Every other part of the cluster, the upgrade went flawlessly, however, now I am stuck with only the primary and one secondary on shard 1. I've attempted resyncing the broken member using MMAPv1 and Wired Tiger, but I continually get the error above. Because of this, one shard is stuck using MMAPV1, and that shard happens to have most of the data (700 GB).
I have also tried rebooting, re-installing the binaries, to no avail.
Any help is appreciated.

I solved this by dropping our giant collection. The rssync must have been hitting some limit since the giant collection had 4.5 billion documents.

Related

MongoDB Resync Failure

We have a shard server with 4 shard PSA Architecture. The overall DB size is around 5Tb. And one of the shard secondary service have failed we started resyc from primary.
We are facing an issue when i am trying to resync data from a primary to secondary.
MongoDB Version 4.0.18
DataSize for that Shard: 571Gb
Oplog Size : Deault
Error Message:
2020-10-06T08:57:57.165+0530 I REPL [replication-339] We are too stale to use host:port as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1601947649, 446) is before their earliest timestamp: Timestamp(1601951946, 330) for 1min until: 2020-10-06T08:58:57.165+0530
You need to do an initial sync to the dead node. See https://docs.mongodb.com/manual/core/replica-set-sync/#initial-sync.

Data loss due to unexpected failover of MongoDB replica set

So I encountered the following issue recently:
I have a 5-member set replica set (priority)
1 x primary (2)
2 x secondary (0.5)
1 x hidden backup (0)
1 x arbiter (0)
One of the secondary replicas with 0.5 priority (let's call it B) encountered some network issue and had intermittent connectivity with the rest of the replica set. However, despite having staler data and a lower priority than the existing primary (let's call it A) it assumed primary role:
[ReplicationExecutor] VoteRequester: Got no vote from xxx because: candidate's data is staler than mine, resp:{ term: 29, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
[ReplicationExecutor] election succeeded, assuming primary role in term 29
[ReplicationExecutor] transition to PRIMARY
And for A, despite not having any connection issues with the rest of the replica set:
[ReplicationExecutor] stepping down from primary, because a new term has begun: 29
So Question 1 is, how could this have been possible given the circumstances?
Moving on, A (now a secondary) began rolling back data:
[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 28, timestamp: xxx). source's GTE: (term: 29, timestamp: xxx) hashes: (xxx/xxx)
[rsBackgroundSync] beginning rollback
[rsBackgroundSync] rollback 0
[ReplicationExecutor] transition to ROLLBACK
This caused data which was written to be removed. So Question 2 is: How does an OplogStart go missing?
Last but not least, Question 3, how can this be prevented?
Thank you in advance!
You are using version 3.2.x and protocolVersion=1 (you can check it with rs.conf() -command)? Because there is "bug" on voting.
You can prevent this bug by (choose one or both):
change protocolVersion to 0.
cfg = rs.conf();
cfg.protocolVersion=0;
rs.reconfig(cfg);
change all priorities to same value
EDIT:
These are tickets what explain.. More or less..
Ticket 1
Ticket 2

Invalid access on Mongo 3.0

Problem happened in Development site. Following query crashes server with Invalid access at 0x20:
db['2015-04-13'].group({
key:{id:1},
cond:{created_at:{$gte: new Date('2015-04-13')}},
reduce:function (curr, resul) {},
initial: {}
})
Traceback:
mongod(_ZN2v88internal2OS8AllocateEmPmb+0xD7) [0x11dbe57]
mongod(_ZN2v88internal28CreateTranscendentalFunctionENS0_19TranscendentalCache4TypeE+0x26) [0x12799f6]
mongod(_ZN2v88internal22init_fast_sin_functionEv+0xE) [0x11dca1e]
mongod(_ZN2v88internal14POSIXPostSetUpEv+0x9) [0x11dd009]
mongod(_ZN2v88internal2V828InitializeOncePerProcessImplEv+0x3E) [0x12551de]
mongod(_ZN2v88internal12CallOnceImplEPlPFvPvES2_+0x52) [0x11c2c12]
mongod(_ZN2v88internal2V810InitializeEPNS0_12DeserializerE+0x11) [0x1255911]
mongod(_ZN2v86LockerC1EPNS_7IsolateE+0x61) [0x12597c1]
So far i know:
Problem occurs only when mongod runs as it's own user (mongod).
If mongod started as root on same data folder, query passes and return results. Number of documents in collection is fairly small (around 20k), but there is decent number of keys for each - 50 in average, and 300 at most, most of them Strings with very few BSONs. MongoDB version is 3.0.2, query was passed as though local client with same version as server, as though 2.4.0 Robomongo client on remote machine - error appears in both cases.

mongo shell throwing error while running "show collections" command

My mongo shell is starting without any error
>use mydb is also working properly (here db name is mydb)
but when i am giving show collections command, it is showing following error.
>show collections
Wed Oct 15 17:38:30 uncaught exception: error: {
"$err" : "file /var/lib/mongodb/mydb.6 open/create failed in createPrivateMap (look in log for more information)",
"code" : 13636
}
Here is the error log
17:38:22 [initandlisten] connection accepted from 127.0.0.1:53178 #1
17:38:30 [conn1] ERROR: mmap private failed with out of memory. You are using a 32-bit build and probably need to upgrade to 64
17:38:30 [conn1] Assertion: 13636:file /var/lib/mongodb/mydb.6 open/create failed in createPrivateMap (look in log for more information)
17:38:30 [conn1] assertion 13636 file /var/lib/mongodb/mydb.6 open/create failed in createPrivateMap (look in log for more information) ns:mydb.system.namespaces query:{}
17:39:01 [clientcursormon] mem (MB) res:2 virt:90 mapped:0
Based on one solution given for another stackoverflow question ,couldn't connect to server 127.0.0.1 shell/mongo.js , i tried same step in my case and problem was solved for time being, but the main issue is whenever i shutdown my machine and restart again i get the same error and i have to repeat same steps (as given in above link) to make mongo shell working and it ultimately lead to data loss within collections. Can anyone suggest what could be the reason , is there some problem with my mongodb installation? Please let me know if anyone had similar issue and successfully resolved it . Thanks
I think there's two possible sources of the problem:
your computer doesn't have enough RAM available
as the log message says, you are using a 32-bit build of MongoDB and should use a 64-bit build, because you have two much data to memory map with 32-bit (> about 2.5 GB)

mongodb sharded collection query failed: setShardVersion failed host

I have encountered a problem after adding a shard to mongodb cluster.
I did the following operations:
1. deploy a mongodb cluster with primary shard named 'shard0002'(10.204.8.155:27010) for all databases.
2. for some reason I removed it and add a new shard of different host (10.204.8.100:27010, was automaticlly named shard0002 too) after migrating finished.
3. then add another shard (the one removed in step 1), named 'shard0003'
4. executing query on a sharded collection.
5. the following errors appeared:
mongos> db.count.find()
error: {
"$err" : "setShardVersion failed host: 10.204.8.155:27010 { errmsg: \"exception: gotShardName different than what i had before before [shard0002] got [shard0003] \", code: 13298, ok: 0.0 }",
"code" : 10429
}
I tried to rename the shardname, but it's not allowed.
mongos> use config
mongos> db.shards.update({_id: "shard0003"}, {$set: {_id: "shard_11"}})
Mod on _id not allowed
I have also tried to remove it, draining stared but processing seems hung up.
what can I do ?
------------------------
lastupate (24/02/2014 00:29)
I found the answer on google. since mongod has it's own cache for mongod configuration, just restart the sharded mongod process, the problem will be fixed.