what does "Fatal Assertion 18750" mean? - mongodb

In my mongodb replicaSet, I found one of the secondary node is down, and when I check the db.log, I found this:
I REPL [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
I REPL [ReplicationExecutor] syncing from: primary-node-ip:portNum
I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to primary-node-ip:portNum
I REPL [rsBackgroundSync] replSet our last op time fetched: Nov 25 05:41:01:85
I REPL [rsBackgroundSync] replset source's GTE: Nov 25 05:41:02:1
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
I REPL [rsBackgroundSync] minvalid: 5a1891f0:187 our last optime: 5a1891ed:28
I - [rsBackgroundSync] Fatal Assertion 18750
I - [rsBackgroundSync]
***aborting after fassert() failure
I googled, but don't really find any page to get this 18750 fatal assertion thing clearly.
the mongoDB version is 3.0

You didn't say the MongoDB version you're using, but that particular assertion can be traced back to MongoDB 3.0 series.
Particularly, the cause of the assertion is printed in the logs you posted:
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
This message was printed by this part of the source code: https://github.com/mongodb/mongo/blob/v3.0/src/mongo/db/repl/rs_rollback.cpp#L837-L841
What that message means is that the node needs to perform a rollback, but it discovered that it is unable to do so because it is in an inconsistent state (e.g. no rollback can be performed).
One possible cause of this issue is unreliable network connection between the replica set and the application, and also between the replica set nodes themselves, although the exact cause may be different between one deployment and another.
Please see Rollbacks During Replica Set Failover for more information regarding rollbacks.
Unfortunately there's not much that can be done in this case except doing a resync process of the asserting node. Please see Resync a Member of a Replica Set for details on how to do so.

Related

Find out when SECONDARY became PRIMARY

Let's say I have a three server setup. Two servers store data, and a server is an arbiter.
Last week, my 'PRIMARY' server went down, and as expected the 'SECONDARY' was promoted and things continued working as expected.
However, I'm now debugging another bug in my application that I think might be related to this change in the replication setup.
Is there any way I can find out (from the logs or whatnot) WHEN exactly the election occurred?
You can find in the logs (of the new 'PRIMARY') the following lines:
2018-08-02T03:56:49.817+0000 I REPL [ReplicationExecutor] Standing for election
2018-08-02T03:56:49.818+0000 I REPL [ReplicationExecutor] not electing self, ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal:27017 has same OpTime as us: { : Timestamp 1533182831000|1 }
2018-08-02T03:56:49.818+0000 I REPL [ReplicationExecutor] possible election tie; sleeping 445ms until 2018-08-02T03:56:50.263+0000
2018-08-02T03:56:50.263+0000 I REPL [ReplicationExecutor] Standing for election
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] not electing self, ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal:27017 has same OpTime as us: { : Timestamp 1533182831000|1 }
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] running for election; slept last election, so running regardless of possible tie
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] received vote: 1 votes from ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal:27017
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] election succeeded, assuming primary role
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] transition to PRIMARY
You can see the election took place at 3:56am UTC.
I advise you use the less tool to search in your log file:
less /var/log/mongodb/mongod.log
Then, navigate at the end of the file using G, then search backward with ?, and search for 'Standing for election'.

rsBackgroundSync : Fatal Assertion 18752 and abort mongodb process

My MongoDB version 3.2.10.
Mongod process has been terminated and then I checked in log file and found this content.
2017-05-24T06:26:14.824+0000 I REPL [ReplicationExecutor] This node is in the config
2017-05-24T06:26:14.824+0000 I REPL [ReplicationExecutor] transition to STARTUP2
2017-05-24T06:26:14.824+0000 I REPL [ReplicationExecutor] Starting replication applier threads
2017-05-24T06:26:14.825+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2017-05-24T06:26:14.827+0000 I REPL [ReplicationExecutor] transition to SECONDARY
2017-05-24T06:26:14.827+0000 I REPL [ReplicationExecutor] Member is now in state PRIMARY
2017-05-24T06:26:15.045+0000 I REPL [ReplicationExecutor] Member is now in state ARBITER
2017-05-24T06:26:16.358+0000 I NETWORK [initandlisten] connection accepted from 10.52.202.233:35445 #2 (2 connections now open)
2017-05-24T06:26:17.825+0000 I REPL [ReplicationExecutor] syncing from:
2017-05-24T06:26:17.829+0000 I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to
2017-05-24T06:26:17.829+0000 I REPL [rsBackgroundSync] replSet our last op time fetched: May 24 01:02:13:2
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] replset source's GTE: May 24 01:02:56:1
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] beginning rollback
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] rollback 0
2017-05-24T06:26:17.830+0000 I REPL [ReplicationExecutor] transition to ROLLBACK
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] rollback 1
2017-05-24T06:26:17.830+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
2017-05-24T06:26:17.831+0000 I REPL [rsBackgroundSync] replSet info rollback our last optime: May 24 01:02:13:2
2017-05-24T06:26:17.831+0000 I REPL [rsBackgroundSync] replSet info rollback their last optime: May 24 06:25:57:3
2017-05-24T06:26:17.831+0000 I REPL [rsBackgroundSync] replSet info rollback diff in end of log times: -19424 seconds
2017-05-24T06:26:19.583+0000 F REPL [rsBackgroundSync] warning: log line attempted (8477k) over max size (10k), printing beginning and end ... replSet error can't rollback this command yet: {... 2017-05-24T06:26:19.583+0000 I REPL [rsBackgroundSync] replSet cmdname=applyOps
2017-05-24T06:26:19.583+0000 E REPL [rsBackgroundSync] replica set fatal exception
2017-05-24T06:26:19.583+0000 I - [rsBackgroundSync] Fatal Assertion 18752
2017-05-24T06:26:19.583+0000 I - [rsBackgroundSync]
How can I bring it back and what is "log line attempted (8477k) over max size (10k)" and "Fatal Assertion 18752"?
Currently, node2 became to Primary.
Thanks,
Hiko
Finally, I have deleted all data files and re-synced from another node. It's done

Spark rdd.count() yields inconsistent results

I'm a bit baffled.
A simple rdd.count() gives different results when run multiple times.
Here is the code i run:
val inputRdd = sc.newAPIHadoopRDD(inputConfig,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Long],
classOf[org.bson.BSONObject])
println(inputRdd.count())
It opens a connection to a MondoDb Server and simply counts the Objects.
Seems pretty straight forward to me
According to MongoDb there are 3,349,495 entries
Here is my spark output, all ran the same jar:
spark1 : 3.257.048
spark2 : 3.303.272
spark3 : 3.303.272
spark4 : 3.303.272
spark5 : 3.303.271
spark6 : 3.303.271
spark7 : 3.303.272
spark8 : 3.303.272
spark9 : 3.306.300
spark10: 3.303.272
spark11: 3.303.271
Spark and MongoDb are run on the same cluster.
We are running:
Spark version 1.5.0-cdh5.6.1
Scala version 2.10.4
MongoDb version 2.6.12
Unfortunately we can not update these
Is Spark non-deterministic?
Is there anyone who can enlighten me?
Thanks in advance
EDIT/ Further Info
I just noticed an error in our mongod.log.
Could this error cause the inconsistent behaviour?
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet syncing to: hadoop05:27017
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet error RS102 too stale to catch up, at least from hadoop05:27017
[rsBackgroundSync] replSet our last optime : Jul 2 10:19:44 57777920:111
[rsBackgroundSync] replSet oldest at hadoop05:27017 : Jul 5 15:17:58 577bb386:59
[rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
[rsBackgroundSync] replSet error RS102 too stale to catch up
As you already spotted, the problem does not appear to be with spark (or scala) but with MongoDB.
As such the question regarding the difference seems to be resolved.
You will still want to troubleshoot the actual MongoDB error, the provided link can be a good starting point for that: http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
count returns an estimated count. As such, the value returned can change even if the number of documents hasn't changed.
countDocuments was added to MongoDB 4.0 to provide an accurate count (that also works in multi-document transactions).

MongoDB 2.6.6 crashed with "Invalid access at address"

Running MongoDB on my company's QA env, I ran into this error in the log:
2015-02-22T04:48:06.194-0500 [rsHealthPoll] SEVERE: Invalid access at address: 0
2015-02-22T04:48:06.290-0500 [rsHealthPoll] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0xf62526 0xf62300 0xf6241f 0x7fc70b581710 0xca12c2 0xca14e7 0xca3bb6 0xf02995 0xefb6d8 0xf9af1c 0x7fc70b5799d1 0x7fc70a2d28fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xf62526]
/usr/bin/mongod() [0xf62300]
/usr/bin/mongod() [0xf6241f]
/lib64/libpthread.so.0(+0xf710) [0x7fc70b581710]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask12tryHeartbeatEPNS_7BSONObjEPi+0x52) [0xca12c2]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0xf7) [0xca14e7]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xca3bb6]
/usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xf02995]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x128) [0xefb6d8]
/usr/bin/mongod() [0xf9af1c]
/lib64/libpthread.so.0(+0x79d1) [0x7fc70b5799d1]
/lib64/libc.so.6(clone+0x6d) [0x7fc70a2d28fd]
It seems there's some segmentation fault in rsHealthPoll.
This is from a mongod instance running as part of a replica set in a shard-ready cluster (2 mongods + 1 arbiter running with config servers and mongos processes).
This DB mostly receives writes of new records, periodically updating a boolean to True for some records, and some reads, according to user activity querying it. (Single collection at the moment)
Googling this error only gave me other, older, already-solved segmentation fault bugs in MongoDB Jira.
Anyone seen this recently or knows the reason?

MongoDB: How to remove an index on a replicaset?

I see that the MongoDB documentation says that removing index is by calling db.accounts.dropIndex( { "tax-id": 1 } ). But it does not say whether the node needs to be removed from the replicaset or not.
I tried to take a secondary node in a replicaset offline and restart as a standalone node (in a different port) and tried to drop the index.
But after bringing back the node in the replica set with regular process sudo service mongod start, the mongod process is dying saying the index got corrupted.
Thu Oct 31 19:52:38.098 [repl writer worker 1] Assertion: 15898:error in index possibly corruption consider repairing 382
0xdddd81 0xd9f55b 0xd9fa9c 0x7edb83 0x7fb332 0x7fdc08 0x9d3b50 0x9c796e 0x9deb64 0xac45dd 0xac58df 0xa903fa 0xa924c7 0xa71f6c 0xc273d3 0xc26b18 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd9f55b]
/usr/bin/mongod() [0xd9fa9c]
/usr/bin/mongod(_ZN5mongo11checkFailedEj+0x143) [0x7edb83]
/usr/bin/mongod(_ZNK5mongo12BucketBasicsINS_12BtreeData_V1EE11basicInsertENS_7DiskLocERiS3_RKNS_5KeyV1ERKNS_8OrderingE+0x222) [0x7fb332]
/usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE10insertHereENS_7DiskLocEiS3_RKNS_5KeyV1ERKNS_8OrderingES3_S3_RNS_12IndexDetailsE+0x68) [0x7fdc08]
/usr/bin/mongod(_ZNK5mongo30IndexInsertionContinuationImplINS_12BtreeData_V1EE22doIndexInsertionWritesEv+0xa0) [0x9d3b50]
/usr/bin/mongod(_ZN5mongo14IndexInterface13IndexInserter19finishAllInsertionsEv+0x1e) [0x9c796e]
/usr/bin/mongod(_ZN5mongo24indexRecordUsingTwoStepsEPKcPNS_16NamespaceDetailsENS_7BSONObjENS_7DiskLocEb+0x754) [0x9deb64]
/usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbbPb+0x123d) [0xac45dd]
/usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEbb+0x4f) [0xac58df]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x2eda) [0xa903fa]
/usr/bin/mongod(_ZN5mongo27updateObjectsForReplicationEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0xa924c7]
/usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x65c) [0xa71f6c]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc273d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc26b18]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.106 [repl writer worker 1] ERROR: writer worker caught exception: error in index possibly corruption consider repairing 382 on:
xxxxxxxx--deleted content related to the data...xxxxxxxxxxxxx
Thu Oct 31 19:52:38.106 [repl writer worker 1] Fatal Assertion 16360
0xdddd81 0xd9dc13 0xc26bfc 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd9dc13]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.108 [repl writer worker 1]
***aborting after fassert() failure
Thu Oct 31 19:52:38.108 Got signal: 6 (Aborted).
Is this due to dropping the index in the offline mode on the secondary? Any suggestions on the proper way to drop the index is highly appreciated.
The proper way to remove index from replica set is to drop it on primary. The idea of replica is having the same copy of data (with small time lags). So whenever you do something on primary is copied to the secondaries. So if you start doing anything on the primary, right after it finishes this process, the process propagates to secondaries.
If you are removing index from primary - the index will be removed on the secondary as well.