MongoDB Resync Failure - mongodb

We have a shard server with 4 shard PSA Architecture. The overall DB size is around 5Tb. And one of the shard secondary service have failed we started resyc from primary.
We are facing an issue when i am trying to resync data from a primary to secondary.
MongoDB Version 4.0.18
DataSize for that Shard: 571Gb
Oplog Size : Deault
Error Message:
2020-10-06T08:57:57.165+0530 I REPL [replication-339] We are too stale to use host:port as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1601947649, 446) is before their earliest timestamp: Timestamp(1601951946, 330) for 1min until: 2020-10-06T08:58:57.165+0530

You need to do an initial sync to the dead node. See https://docs.mongodb.com/manual/core/replica-set-sync/#initial-sync.

Related

mongoimport loading only 1000 rows on sharding

I have a mongo sharding setup configuration like
6 config server
3 shard server (with replica)
6 router
for example:
**s1->s2 (one shard with replicat (primary:s1,secondry:s2))
s3->s4 (2nd shard with replics (primary s3, secondry s4))
s5->s6 (third shard with replics (primary s5, secondry s6))
config, router is on all server i.e s1 to s6**
I am not able to import data to one of the empty sharded collection , data is in csv format.
I m running mongoimport in background and the nohup out shows like this
**2017-01-10T17:13:18.444+0530 [........................] dbname.collectionname 364.0 KB/46.1 MB (0.8%)**
mongoimport is stuck , how to fix this.
I first tried to run mongoimport on s2 but not succeeded then try to run mongoimport on s1 no success
follwing are the errors servers from routerlog , configuration log
**HostnameCanonicalizationWorker
[rsBackgroundSync] we are too stale to use **** as a
sync source
REPL [ReplicationExecutor] could not find member to sync from
REPL [ReplicationExecutor] The liveness timeout does not match callback handle, so not resetting it.
REPL [rsBackgroundSync] too stale to catch up -- entering maintenance mode**

Data loss due to unexpected failover of MongoDB replica set

So I encountered the following issue recently:
I have a 5-member set replica set (priority)
1 x primary (2)
2 x secondary (0.5)
1 x hidden backup (0)
1 x arbiter (0)
One of the secondary replicas with 0.5 priority (let's call it B) encountered some network issue and had intermittent connectivity with the rest of the replica set. However, despite having staler data and a lower priority than the existing primary (let's call it A) it assumed primary role:
[ReplicationExecutor] VoteRequester: Got no vote from xxx because: candidate's data is staler than mine, resp:{ term: 29, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
[ReplicationExecutor] election succeeded, assuming primary role in term 29
[ReplicationExecutor] transition to PRIMARY
And for A, despite not having any connection issues with the rest of the replica set:
[ReplicationExecutor] stepping down from primary, because a new term has begun: 29
So Question 1 is, how could this have been possible given the circumstances?
Moving on, A (now a secondary) began rolling back data:
[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 28, timestamp: xxx). source's GTE: (term: 29, timestamp: xxx) hashes: (xxx/xxx)
[rsBackgroundSync] beginning rollback
[rsBackgroundSync] rollback 0
[ReplicationExecutor] transition to ROLLBACK
This caused data which was written to be removed. So Question 2 is: How does an OplogStart go missing?
Last but not least, Question 3, how can this be prevented?
Thank you in advance!
You are using version 3.2.x and protocolVersion=1 (you can check it with rs.conf() -command)? Because there is "bug" on voting.
You can prevent this bug by (choose one or both):
change protocolVersion to 0.
cfg = rs.conf();
cfg.protocolVersion=0;
rs.reconfig(cfg);
change all priorities to same value
EDIT:
These are tickets what explain.. More or less..
Ticket 1
Ticket 2

MongoDB storageEngine from MMAPv1 to wiredTiger fassert() failure

I am upgrading my cluster to wiredTiger using this site: https://docs.mongodb.org/manual/tutorial/change-replica-set-wiredtiger/
I have been having the following issue:
Environment details:
MongoDB 3.0.9 in a sharded cluster on Red Hat Enterprise Linux Server release 6.2 (Santiago). I have 4 shards, each one is a replica set with 3 members. I just recently upgraded all binaries from 2.4 to 3.0.9. Every server has updated binaries, I tried converting each replica set to wired tiger storage engine, but I was getting the following error when upgrading the secondary on one member server (shard 1):
2016-02-09T12:36:39.366-0500 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 56b9c217:ab newest timestamp: 56b9b429:60. Op being applied: { ts: Timestamp 1455010857000|96, h: 2267356763748731326, v: 2, op: "d", ns: "General.Tickets", fromMigrate: true, b: true, o: { _id: ObjectId('566aec7bdfd4b700e73d64db') }
2016-02-09T12:36:39.366-0500 I - [rsSync] Fatal Assertion 18905
2016-02-09T12:36:39.366-0500 I - [rsSync]
***aborting after fassert() failure
This is an open bug with replication: https://jira.mongodb.org/browse/SERVER-17081
Every other part of the cluster, the upgrade went flawlessly, however, now I am stuck with only the primary and one secondary on shard 1. I've attempted resyncing the broken member using MMAPv1 and Wired Tiger, but I continually get the error above. Because of this, one shard is stuck using MMAPV1, and that shard happens to have most of the data (700 GB).
I have also tried rebooting, re-installing the binaries, to no avail.
Any help is appreciated.
I solved this by dropping our giant collection. The rssync must have been hitting some limit since the giant collection had 4.5 billion documents.

mongodb sharded collection query failed: setShardVersion failed host

I have encountered a problem after adding a shard to mongodb cluster.
I did the following operations:
1. deploy a mongodb cluster with primary shard named 'shard0002'(10.204.8.155:27010) for all databases.
2. for some reason I removed it and add a new shard of different host (10.204.8.100:27010, was automaticlly named shard0002 too) after migrating finished.
3. then add another shard (the one removed in step 1), named 'shard0003'
4. executing query on a sharded collection.
5. the following errors appeared:
mongos> db.count.find()
error: {
"$err" : "setShardVersion failed host: 10.204.8.155:27010 { errmsg: \"exception: gotShardName different than what i had before before [shard0002] got [shard0003] \", code: 13298, ok: 0.0 }",
"code" : 10429
}
I tried to rename the shardname, but it's not allowed.
mongos> use config
mongos> db.shards.update({_id: "shard0003"}, {$set: {_id: "shard_11"}})
Mod on _id not allowed
I have also tried to remove it, draining stared but processing seems hung up.
what can I do ?
------------------------
lastupate (24/02/2014 00:29)
I found the answer on google. since mongod has it's own cache for mongod configuration, just restart the sharded mongod process, the problem will be fixed.

Strange thing about mongodb-erlang driver when using replica set

My code is like this:
Replset = {<<"rs1">>, [{localhost, 27017}, {localhost, 27018}, {localhost, 27019}]},
Conn_Pool = resource_pool:new (mongo:rs_connect_factory(Replset), 10),
...
Conn = resource_pool:get(Conn_Pool)
case mongo:do(safe, master, Conn, ?DATABASE,
fun() ->
mongo:insert(mytable, {'_id', 26, d, 11})
end end)
...
27017 is the primary node, so ofc I can insert the data successfully.
But, when I put only one secondary node in the code instead of all of mongo rs instances: Replset = {<<"rs1">>, [{localhost, 27019}]}, I can also insert the data.
I thought it should have thrown exception or error, but it had written the data successfully.
why that happened?
When you connect to a replica set, you specify the name of the replSet and some of the node names as seeds. The driver connects to the seed nodes in turn and discovers the real replica set membership/config/status via 'db.isMaster()' command.
Since it discovers which node is the primary that way, it is able to then route all your write requests accordingly. The same technique is what enables it to automatically failover to the newly elected primary when the original primary fails and a new one is elected.