Data loss due to unexpected failover of MongoDB replica set - mongodb

So I encountered the following issue recently:
I have a 5-member set replica set (priority)
1 x primary (2)
2 x secondary (0.5)
1 x hidden backup (0)
1 x arbiter (0)
One of the secondary replicas with 0.5 priority (let's call it B) encountered some network issue and had intermittent connectivity with the rest of the replica set. However, despite having staler data and a lower priority than the existing primary (let's call it A) it assumed primary role:
[ReplicationExecutor] VoteRequester: Got no vote from xxx because: candidate's data is staler than mine, resp:{ term: 29, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
[ReplicationExecutor] election succeeded, assuming primary role in term 29
[ReplicationExecutor] transition to PRIMARY
And for A, despite not having any connection issues with the rest of the replica set:
[ReplicationExecutor] stepping down from primary, because a new term has begun: 29
So Question 1 is, how could this have been possible given the circumstances?
Moving on, A (now a secondary) began rolling back data:
[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 28, timestamp: xxx). source's GTE: (term: 29, timestamp: xxx) hashes: (xxx/xxx)
[rsBackgroundSync] beginning rollback
[rsBackgroundSync] rollback 0
[ReplicationExecutor] transition to ROLLBACK
This caused data which was written to be removed. So Question 2 is: How does an OplogStart go missing?
Last but not least, Question 3, how can this be prevented?
Thank you in advance!

You are using version 3.2.x and protocolVersion=1 (you can check it with rs.conf() -command)? Because there is "bug" on voting.
You can prevent this bug by (choose one or both):
change protocolVersion to 0.
cfg = rs.conf();
cfg.protocolVersion=0;
rs.reconfig(cfg);
change all priorities to same value
EDIT:
These are tickets what explain.. More or less..
Ticket 1
Ticket 2

Related

MongoDB Resync Failure

We have a shard server with 4 shard PSA Architecture. The overall DB size is around 5Tb. And one of the shard secondary service have failed we started resyc from primary.
We are facing an issue when i am trying to resync data from a primary to secondary.
MongoDB Version 4.0.18
DataSize for that Shard: 571Gb
Oplog Size : Deault
Error Message:
2020-10-06T08:57:57.165+0530 I REPL [replication-339] We are too stale to use host:port as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1601947649, 446) is before their earliest timestamp: Timestamp(1601951946, 330) for 1min until: 2020-10-06T08:58:57.165+0530
You need to do an initial sync to the dead node. See https://docs.mongodb.com/manual/core/replica-set-sync/#initial-sync.

How to log newly instantiated member status changes in a replica set

I'm trying to benchmark when a new mongo replica member
first communicates to my replica set.
change state from STARTUP2 -> RECOVERY
change state from RECOVERING -> SECONDARY
I'm specifically looking for exact duration on how long it took for the replica to run it's initial sync or the timestamps to parse such actions by a new member.
Is there a log that will give me this information?
I'm currently using the following script in mongo shell.
MAXSCRIPT_RUN_ITERATIONS = 1800 // approximately 30 minutes
for (var i = 0; i < MAXSCRIPT_RUN_ITERATIONS; i++) {
sleep(1000);
var currentdate = new Date();
var datetime = new Date().toLocaleString();
var rsStatus = rs.status()
var members = rsStatus.members
// Change this index depending on the # of members in your set
var resyncmember = members[0]
var resyncMemberState = resyncmember.stateStr
print("--------------")
print("Member Count: " + members.length + " " + datetime)
for (var n = 0; n < members.length; n++){
var member = members[n]
print("HOST: " + member.name + " State: " + member.stateStr)
}
if (resyncMemberState == "SECONDARY"){
print("########################")
var currentdate = new Date();
print("resyncMember finished: " + datetime)
print("########################")
break
}
}
You don't state what version you're running on, so I'll give information about 3.0 (3.2) should be the same.
Look through your mongodb log for entries like the following:
2016-02-25T14:59:43.684+0000 I REPL [rsSync] initial sync drop all databases
2016-02-25T14:59:43.684+0000 I REPL [rsSync] initial sync clone all databases
2016-02-25T14:59:43.688+0000 I REPL [rsSync] initial sync cloning db: admin
2016-02-25T14:59:43.833+0000 I REPL [rsSync] initial sync cloning db: db1
2016-02-26T10:31:33.763+0000 I REPL [rsSync] initial sync cloning db: test
2016-02-26T11:27:48.480+0000 I REPL [rsSync] initial sync data copy, starting syncup
2016-02-26T11:27:48.481+0000 I REPL [rsSync] oplog sync 1 of 3
2016-02-26T11:27:49.043+0000 I REPL [ReplicationExecutor] syncing from: xxxxxxxxxx:27017
2016-02-26T11:27:49.059+0000 I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to xxxxxxxxxx:27017
2016-02-26T11:30:05.649+0000 I REPL [rsSync] oplog sync 2 of 3
2016-02-26T11:30:05.657+0000 I REPL [rsSync] initial sync building indexes
2016-02-26T11:30:05.657+0000 I REPL [rsSync] initial sync cloning indexes for : admin
2016-02-26T11:30:05.760+0000 I REPL [rsSync] initial sync cloning indexes for : db1
2016-02-26T11:43:37.262+0000 I REPL [rsSync] initial sync cloning indexes for : test
2016-02-26T11:43:48.271+0000 I REPL [rsSync] oplog sync 3 of 3
2016-02-26T11:43:48.319+0000 I REPL [rsSync] initial sync finishing up
2016-02-26T11:43:48.319+0000 I REPL [rsSync] replSet set minValid=56d03a74:1
2016-02-26T11:43:48.321+0000 I REPL [rsSync] initial sync done
2016-02-26T11:43:48.332+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2016-02-26T11:43:48.348+0000 I REPL [ReplicationExecutor] transition to SECONDARY
As you can see, you see information for each stage of the initial sync process, and when the transition to a SECONDARY node occurs. You will see more lines than this, but I've cut down to show what you should look for.
You can view this realtime with a command similar to the following if you're on a Linux or Mac machine (make sure to run this before adding the new node or you might miss some lines)
tail -f /path/to/mongodb.log | grep REPL
This will watch your log file as entries are being added and only display the lines with the string REPL in it. Remember to kill the tail command once you've gotten the information you need.

MongoDB storageEngine from MMAPv1 to wiredTiger fassert() failure

I am upgrading my cluster to wiredTiger using this site: https://docs.mongodb.org/manual/tutorial/change-replica-set-wiredtiger/
I have been having the following issue:
Environment details:
MongoDB 3.0.9 in a sharded cluster on Red Hat Enterprise Linux Server release 6.2 (Santiago). I have 4 shards, each one is a replica set with 3 members. I just recently upgraded all binaries from 2.4 to 3.0.9. Every server has updated binaries, I tried converting each replica set to wired tiger storage engine, but I was getting the following error when upgrading the secondary on one member server (shard 1):
2016-02-09T12:36:39.366-0500 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 56b9c217:ab newest timestamp: 56b9b429:60. Op being applied: { ts: Timestamp 1455010857000|96, h: 2267356763748731326, v: 2, op: "d", ns: "General.Tickets", fromMigrate: true, b: true, o: { _id: ObjectId('566aec7bdfd4b700e73d64db') }
2016-02-09T12:36:39.366-0500 I - [rsSync] Fatal Assertion 18905
2016-02-09T12:36:39.366-0500 I - [rsSync]
***aborting after fassert() failure
This is an open bug with replication: https://jira.mongodb.org/browse/SERVER-17081
Every other part of the cluster, the upgrade went flawlessly, however, now I am stuck with only the primary and one secondary on shard 1. I've attempted resyncing the broken member using MMAPv1 and Wired Tiger, but I continually get the error above. Because of this, one shard is stuck using MMAPV1, and that shard happens to have most of the data (700 GB).
I have also tried rebooting, re-installing the binaries, to no avail.
Any help is appreciated.
I solved this by dropping our giant collection. The rssync must have been hitting some limit since the giant collection had 4.5 billion documents.

SocketException in Mongo

I just set up a replica set in Mongo (prod environment). I'm now getting a lot of exceptions like below (clipped).
I went into mongo and ran a serverStatus command on my primary mongo node and only have about 300 connections going, so it's hardly working.
Below are my connection option settings in my server code:
auto_connect_retry = false
connections_per_host = 10
threads_multiplier = 10
max_wait_time = 120000
connect_timeout = 10000
socket_timeout = 0
Do I have something mis-configured?
Sep 9, 2013 8:31:26 PM com.mongodb.DBPortPool gotError
WARNING: emptying DBPortPool to /10.0.8.10:27017 b/c of error
java.net.SocketException: Connection timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:146)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at org.bson.io.Bits.readFully(Bits.java:46)
at org.bson.io.Bits.readFully(Bits.java:33)
at org.bson.io.Bits.readFully(Bits.java:28)
at com.mongodb.Response.<init>(Response.java:40)
at com.mongodb.DBPort.go(DBPort.java:142)
at com.mongodb.DBPort.call(DBPort.java:92)
at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:244)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:216)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:288)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:273)
at com.mongodb.DBCollection.findOne(DBCollection.java:347)
at com.mongodb.DBCollection.findOne(DBCollection.java:332)
at com.mongodb.casbah.MongoCollectionBase$class.findOneByID(MongoCollection.scala:232)
at com.mongodb.casbah.MongoCollection.findOneByID(MongoCollection.scala:866)
at com.novus.salat.dao.SalatDAO.findOneById(SalatDAO.scala:353)
at com.novus.salat.dao.ModelCompanion$class.findOneById(ModelCompanion.scala:173)
Generally a connection timeout occurs from one of the following in a replica set
1) All members are not able to communicate with each other
2) A program is connecting to replica for update and it is unable to send it to primary due to overload or 1st as well
3) All relicas are not in sync and one is lagging behind too much
4) Leader election is going on but not completed due to some reason
Please check if your relica set is consistent and all nodes are working by issuing rs.status() on primary node , also as earlier suggested check primary logs for more information

Strange thing about mongodb-erlang driver when using replica set

My code is like this:
Replset = {<<"rs1">>, [{localhost, 27017}, {localhost, 27018}, {localhost, 27019}]},
Conn_Pool = resource_pool:new (mongo:rs_connect_factory(Replset), 10),
...
Conn = resource_pool:get(Conn_Pool)
case mongo:do(safe, master, Conn, ?DATABASE,
fun() ->
mongo:insert(mytable, {'_id', 26, d, 11})
end end)
...
27017 is the primary node, so ofc I can insert the data successfully.
But, when I put only one secondary node in the code instead of all of mongo rs instances: Replset = {<<"rs1">>, [{localhost, 27019}]}, I can also insert the data.
I thought it should have thrown exception or error, but it had written the data successfully.
why that happened?
When you connect to a replica set, you specify the name of the replSet and some of the node names as seeds. The driver connects to the seed nodes in turn and discovers the real replica set membership/config/status via 'db.isMaster()' command.
Since it discovers which node is the primary that way, it is able to then route all your write requests accordingly. The same technique is what enables it to automatically failover to the newly elected primary when the original primary fails and a new one is elected.