rsBackgroundSync : Fatal Assertion 18752 and abort mongodb process - mongodb

My MongoDB version 3.2.10.
Mongod process has been terminated and then I checked in log file and found this content.
2017-05-24T06:26:14.824+0000 I REPL [ReplicationExecutor] This node is in the config
2017-05-24T06:26:14.824+0000 I REPL [ReplicationExecutor] transition to STARTUP2
2017-05-24T06:26:14.824+0000 I REPL [ReplicationExecutor] Starting replication applier threads
2017-05-24T06:26:14.825+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2017-05-24T06:26:14.827+0000 I REPL [ReplicationExecutor] transition to SECONDARY
2017-05-24T06:26:14.827+0000 I REPL [ReplicationExecutor] Member is now in state PRIMARY
2017-05-24T06:26:15.045+0000 I REPL [ReplicationExecutor] Member is now in state ARBITER
2017-05-24T06:26:16.358+0000 I NETWORK [initandlisten] connection accepted from 10.52.202.233:35445 #2 (2 connections now open)
2017-05-24T06:26:17.825+0000 I REPL [ReplicationExecutor] syncing from:
2017-05-24T06:26:17.829+0000 I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to
2017-05-24T06:26:17.829+0000 I REPL [rsBackgroundSync] replSet our last op time fetched: May 24 01:02:13:2
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] replset source's GTE: May 24 01:02:56:1
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] beginning rollback
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] rollback 0
2017-05-24T06:26:17.830+0000 I REPL [ReplicationExecutor] transition to ROLLBACK
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] rollback 1
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
2017-05-24T06:26:17.831+0000 I REPL [rsBackgroundSync] replSet info rollback our last optime: May 24 01:02:13:2
2017-05-24T06:26:17.831+0000 I REPL [rsBackgroundSync] replSet info rollback their last optime: May 24 06:25:57:3
2017-05-24T06:26:17.831+0000 I REPL [rsBackgroundSync] replSet info rollback diff in end of log times: -19424 seconds
2017-05-24T06:26:19.583+0000 F REPL [rsBackgroundSync] warning: log line attempted (8477k) over max size (10k), printing beginning and end ... replSet error can't rollback this command yet: {... 2017-05-24T06:26:19.583+0000 I REPL [rsBackgroundSync] replSet cmdname=applyOps
2017-05-24T06:26:19.583+0000 E REPL [rsBackgroundSync] replica set fatal exception
2017-05-24T06:26:19.583+0000 I - [rsBackgroundSync] Fatal Assertion 18752
2017-05-24T06:26:19.583+0000 I - [rsBackgroundSync]
How can I bring it back and what is "log line attempted (8477k) over max size (10k)" and "Fatal Assertion 18752"?
Currently, node2 became to Primary.
Thanks,
Hiko

Finally, I have deleted all data files and re-synced from another node. It's done

Related

what does "Fatal Assertion 18750" mean?

In my mongodb replicaSet, I found one of the secondary node is down, and when I check the db.log, I found this:
I REPL [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
I REPL [ReplicationExecutor] syncing from: primary-node-ip:portNum
I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to primary-node-ip:portNum
I REPL [rsBackgroundSync] replSet our last op time fetched: Nov 25 05:41:01:85
I REPL [rsBackgroundSync] replset source's GTE: Nov 25 05:41:02:1
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
I REPL [rsBackgroundSync] minvalid: 5a1891f0:187 our last optime: 5a1891ed:28
I - [rsBackgroundSync] Fatal Assertion 18750
I - [rsBackgroundSync]
***aborting after fassert() failure
I googled, but don't really find any page to get this 18750 fatal assertion thing clearly.
the mongoDB version is 3.0
You didn't say the MongoDB version you're using, but that particular assertion can be traced back to MongoDB 3.0 series.
Particularly, the cause of the assertion is printed in the logs you posted:
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
This message was printed by this part of the source code: https://github.com/mongodb/mongo/blob/v3.0/src/mongo/db/repl/rs_rollback.cpp#L837-L841
What that message means is that the node needs to perform a rollback, but it discovered that it is unable to do so because it is in an inconsistent state (e.g. no rollback can be performed).
One possible cause of this issue is unreliable network connection between the replica set and the application, and also between the replica set nodes themselves, although the exact cause may be different between one deployment and another.
Please see Rollbacks During Replica Set Failover for more information regarding rollbacks.
Unfortunately there's not much that can be done in this case except doing a resync process of the asserting node. Please see Resync a Member of a Replica Set for details on how to do so.

Spark rdd.count() yields inconsistent results

I'm a bit baffled.
A simple rdd.count() gives different results when run multiple times.
Here is the code i run:
val inputRdd = sc.newAPIHadoopRDD(inputConfig,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Long],
classOf[org.bson.BSONObject])
println(inputRdd.count())
It opens a connection to a MondoDb Server and simply counts the Objects.
Seems pretty straight forward to me
According to MongoDb there are 3,349,495 entries
Here is my spark output, all ran the same jar:
spark1 : 3.257.048
spark2 : 3.303.272
spark3 : 3.303.272
spark4 : 3.303.272
spark5 : 3.303.271
spark6 : 3.303.271
spark7 : 3.303.272
spark8 : 3.303.272
spark9 : 3.306.300
spark10: 3.303.272
spark11: 3.303.271
Spark and MongoDb are run on the same cluster.
We are running:
Spark version 1.5.0-cdh5.6.1
Scala version 2.10.4
MongoDb version 2.6.12
Unfortunately we can not update these
Is Spark non-deterministic?
Is there anyone who can enlighten me?
Thanks in advance
EDIT/ Further Info
I just noticed an error in our mongod.log.
Could this error cause the inconsistent behaviour?
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet syncing to: hadoop05:27017
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet error RS102 too stale to catch up, at least from hadoop05:27017
[rsBackgroundSync] replSet our last optime : Jul 2 10:19:44 57777920:111
[rsBackgroundSync] replSet oldest at hadoop05:27017 : Jul 5 15:17:58 577bb386:59
[rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
[rsBackgroundSync] replSet error RS102 too stale to catch up
As you already spotted, the problem does not appear to be with spark (or scala) but with MongoDB.
As such the question regarding the difference seems to be resolved.
You will still want to troubleshoot the actual MongoDB error, the provided link can be a good starting point for that: http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
count returns an estimated count. As such, the value returned can change even if the number of documents hasn't changed.
countDocuments was added to MongoDB 4.0 to provide an accurate count (that also works in multi-document transactions).

MongoDB Replica Set election fails with "no replset config has been received" response from peers

I'm trying to establish a replica set with three members (one of which is an arbiter). For technical reasons the members must access each other using SSH tunnelling. I am reasonably confident this is configured correctly as on all the mongodb host's I am able to connect to the other nodes using mongo by providing the relevant --host and --port parameters. When I initiate the replica set on what I'd like to be the primary, the logs show the "initiator" successfully connecting to the other members:
REPL [ReplicationExecutor] transition to RECOVERING
REPL [ReplicationExecutor] transition to SECONDARY
REPL [ReplicationExecutor] Member 10.x.x.1:27017 is now in state STARTUP
REPL [ReplicationExecutor] Member 10.x.x.2:27017 is now in state STARTUP
REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
However, the other members refuse to vote as they don't have any configuration for the replica set
REPL [ReplicationExecutor] VoteRequester: Got error processing response with status: BadValue: Unexpected field info in ReplSetRequestVotes, resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got no vote from 10.16.10.4:30000 because: , resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got error processing response with status: BadValue: Unexpected field info in ReplSetRequestVotes, resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got no vote from 10.16.10.4:27018 because: , resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] not running for primary, we received insufficient votes
This process repeats every electionTimeoutMillis.
running rs.status() on the intiator of the replica set gives a suspicious time for the last heartbeat received from each member
> rs.status()
...
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z")
...
I'm not sure what's cause and effect here. Will a member of the replica set only receive the configuration after the "initiator" has received the heartbeat response? Is there a way to force the initiator to send the configuration to the other members?

MongoDB SECONDARY becoming RECOVERING at nighttime

I am running a conventional MongoDB Replica Set consisting of 3 members (member1 in datacenter A, member2 and member3 in datacenter B).
member1 is the current PRIMARY and I am adding members 2 and 3 via rs.add(). They are performing their initial sync and become SECONDARY very soon. Everything is fine all day long and the replication delay of both members is 0 seconds until 2 AM at nighttime.
Now: Every night at 2 AM both members shift into the RECOVERING state and stop replication at all, which leads to a replication delay of hours when I am having a look into rs.printSlaveReplicationInfo() in the morning hours. At around 2 AM there are no massive inserts or maintenance tasks known to me.
I get the following log entries on the PRIMARY:
2015-10-09T01:59:38.914+0200 [initandlisten] connection accepted from 192.168.227.209:59905 #11954 (37 connections now open)
2015-10-09T01:59:55.751+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.869+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.870+0200 [conn11111] getmore local.oplog.rs cursorid:1155433944036 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:32168 nreturned:0 reslen:20 134ms
2015-10-09T01:59:55.872+0200 [conn11111] end connection 192.168.227.209:58972 (36 connections now open)
And, which is more interesting, I get the following log entries on both SECONDARYs:
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] replSet syncing to: member1:27017
2015-10-09T01:59:56.065+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from member1:27017
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet our last optime : Oct 9 01:59:23 5617035b:17f
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet oldest at member1:27017 : Oct 9 01:59:23 5617035b:1af
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet RECOVERING
Which is also striking - the start of the oplog "resets" itself every night at around 2 AM:
configured oplog size: 990MB
log length start to end: 19485secs (5.41hrs)
oplog first event time: Fri Oct 09 2015 02:00:33 GMT+0200 (CEST)
oplog last event time: Fri Oct 09 2015 07:25:18 GMT+0200 (CEST)
now: Fri Oct 09 2015 07:25:26 GMT+0200 (CEST)
I am not sure if this is somehow correlated to the issue. I am also wondering that such a small delay (Oct 9 01:59:23 5617035b:17f <-> Oct 9 01:59:23 5617035b:1af) lets the members become stale.
Could this also be a server (VM host) time issue or is it something completely different? (Why is the first oplog event being "resetted" every night and not "shifting" to a timestamp like NOW minus 24 hrs?)
What can I do to investigate and to avoid?
Upping the oplog size should solve this (per our comments).
Some references for others who run into this issue
Workloads that Might Require a Larger Oplog Size
Error: replSet error RS102 too stale to catch up link1 & link2

Secondary keeps rolling back

In last 7 days, three times our secondary servers went down with the following message. What these errors mean? Why does it rollback? I have attached the screen shot of the oplog window and replication lag.
Around 4AM the server went down. Around 3:50 the replication lag went to 300 seconds, but that is just 5 mins, the node has more oplog window.
We take backups using MMS from one of the secondary, does this could be the cause of issue?
Mon May 19 03:50:27.146 [rsBackgroundSync] replSet syncing to: xxxx.prod.xxxx.net:17017
Mon May 19 03:50:27.231 [rsBackgroundSync] replSet our last op time fetched: May
19 03:50:16:152
Mon May 19 03:50:27.231 [rsBackgroundSync] replset source's GTE: May 19 03:50:16
:153
Mon May 19 03:50:27.231 [rsBackgroundSync] replSet rollback 0
Mon May 19 03:50:27.231 [rsBackgroundSync] replSet ROLLBACK
Mon May 19 03:50:27.231 [rsBackgroundSync] replSet rollback 1
Mon May 19 03:50:27.231 [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Mon May 19 03:50:27.232 [rsBackgroundSync] replSet info rollback our last optime
: May 19 03:50:16:152
Mon May 19 03:50:27.232 [rsBackgroundSync] replSet info rollback their last opti
me: May 19 03:50:16:155
Mon May 19 03:50:27.232 [rsBackgroundSync] replSet info rollback diff in end of
log times: 0 seconds
Mon May 19 03:50:27.691 [rsBackgroundSync] replSet rollback found matching event
s at Mar 13 06:12:22:11
Mon May 19 03:50:27.691 [rsBackgroundSync] replSet rollback findcommonpoint scan
ned : 222891
Mon May 19 03:50:27.691 [rsBackgroundSync] replSet replSet rollback 3 fixup
Mon May 19 03:50:30.065 [rsBackgroundSync] replSet rollback 3.5
Mon May 19 03:50:30.065 [rsBackgroundSync] replSet rollback 4 n:7018
Mon May 19 03:50:30.065 [rsBackgroundSync] replSet minvalid=May 19 03:50:16 5379
e1e8:155
Mon May 19 03:50:30.065 [rsBackgroundSync] replSet rollback 4.6
Mon May 19 03:50:30.065 [rsBackgroundSync] replSet rollback 4.7
Mon May 19 03:50:30.443 [rsBackgroundSync] ERROR: rollback cannot find object by
id
Mon May 19 03:50:30.444 [rsBackgroundSync] ERROR: rollback cannot find object by
id
Mon May 19 03:50:30.444 [rsBackgroundSync] replSet rollback 5 d:4 u:7016
Mon May 19 03:50:30.460 [rsBackgroundSync] replSet rollback 6
We found oplog in the primary got corrupted somehow. We found it by running hte following queries
db.oplog.rs.find().sort({$natural:1}).explain()
db.oplog.rs.find().sort({$natural:-1}).explain()
So we did a primary step down, and did a fresh sync.