Running MongoDB on my company's QA env, I ran into this error in the log:
2015-02-22T04:48:06.194-0500 [rsHealthPoll] SEVERE: Invalid access at address: 0
2015-02-22T04:48:06.290-0500 [rsHealthPoll] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0xf62526 0xf62300 0xf6241f 0x7fc70b581710 0xca12c2 0xca14e7 0xca3bb6 0xf02995 0xefb6d8 0xf9af1c 0x7fc70b5799d1 0x7fc70a2d28fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xf62526]
/usr/bin/mongod() [0xf62300]
/usr/bin/mongod() [0xf6241f]
/lib64/libpthread.so.0(+0xf710) [0x7fc70b581710]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask12tryHeartbeatEPNS_7BSONObjEPi+0x52) [0xca12c2]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0xf7) [0xca14e7]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xca3bb6]
/usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xf02995]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x128) [0xefb6d8]
/usr/bin/mongod() [0xf9af1c]
/lib64/libpthread.so.0(+0x79d1) [0x7fc70b5799d1]
/lib64/libc.so.6(clone+0x6d) [0x7fc70a2d28fd]
It seems there's some segmentation fault in rsHealthPoll.
This is from a mongod instance running as part of a replica set in a shard-ready cluster (2 mongods + 1 arbiter running with config servers and mongos processes).
This DB mostly receives writes of new records, periodically updating a boolean to True for some records, and some reads, according to user activity querying it. (Single collection at the moment)
Googling this error only gave me other, older, already-solved segmentation fault bugs in MongoDB Jira.
Anyone seen this recently or knows the reason?
Related
In my mongodb replicaSet, I found one of the secondary node is down, and when I check the db.log, I found this:
I REPL [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
I REPL [ReplicationExecutor] syncing from: primary-node-ip:portNum
I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to primary-node-ip:portNum
I REPL [rsBackgroundSync] replSet our last op time fetched: Nov 25 05:41:01:85
I REPL [rsBackgroundSync] replset source's GTE: Nov 25 05:41:02:1
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
I REPL [rsBackgroundSync] minvalid: 5a1891f0:187 our last optime: 5a1891ed:28
I - [rsBackgroundSync] Fatal Assertion 18750
I - [rsBackgroundSync]
***aborting after fassert() failure
I googled, but don't really find any page to get this 18750 fatal assertion thing clearly.
the mongoDB version is 3.0
You didn't say the MongoDB version you're using, but that particular assertion can be traced back to MongoDB 3.0 series.
Particularly, the cause of the assertion is printed in the logs you posted:
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
This message was printed by this part of the source code: https://github.com/mongodb/mongo/blob/v3.0/src/mongo/db/repl/rs_rollback.cpp#L837-L841
What that message means is that the node needs to perform a rollback, but it discovered that it is unable to do so because it is in an inconsistent state (e.g. no rollback can be performed).
One possible cause of this issue is unreliable network connection between the replica set and the application, and also between the replica set nodes themselves, although the exact cause may be different between one deployment and another.
Please see Rollbacks During Replica Set Failover for more information regarding rollbacks.
Unfortunately there's not much that can be done in this case except doing a resync process of the asserting node. Please see Resync a Member of a Replica Set for details on how to do so.
centos 6.7
postgresql 9.5.3
I've DB servers that are on master-standby replication.
Suddenly, standby server's postgresql process was stopped with this logs.
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]WARNING: page 1671400 of relation base/16400/559613 is uninitialized
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]PANIC: WAL contains references to invalid pages
2016-07-14 18:14:19.544 JST [][5783e03b.3cdb][0][15579]CONTEXT: xlog redo Heap2/VISIBLE: cutoff xid 1902107520
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: startup process (PID 15579) was terminated by signal 6: Aborted
2016-07-14 18:14:21.026 JST [][5783e038.3cd9][0][15577]LOG: terminating any other active server processes
And, master server's postgresql logs were nothing special.
But, master server's /var/log/messages was listed as below.
Jul 14 05:38:44 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 05:38:44 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 05:38:44 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468442324 SOCKET 1 APIC 20
Jul 14 05:38:44 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 05:38:44 host kernel:
Jul 14 18:30:40 host kernel: sbridge: HANDLING MCE MEMORY ERROR
Jul 14 18:30:40 host kernel: CPU 8: Machine Check Exception: 0 Bank 9: 8c000040000800c0
Jul 14 18:30:40 host kernel: TSC 0 ADDR 1f7dad7000 MISC 90004000400008c PROCESSOR 0:306e4 TIME 1468488640 SOCKET 1 APIC 20
Jul 14 18:30:41 host kernel: EDAC MC1: CE row 1, channel 0, label "CPU_SrcID#1_Channel#0_DIMM#1": 1 Unknown error(s): memory scrubbing on FATAL area : cpu=8 Err=0008:00c0 (ch=0), addr = 0x1f7dad7000 => socket=1, Channel=0(mask=1), rank=4
Jul 14 18:30:41 host kernel:
The memory error's started at 1 week ago. So, I doubt the memory error causes postgresql's error.
My question is here.
1) Can memory error of kernel cause postgresql's "WAL contains references to invalid pages" error?
2) Why there is not any logs at master server's postgresql?
thx.
Faulty memory can cause all kinds of data corruption, so that seems like a good enough explanation to me.
Perhaps there are no log entries at the master PostgreSQL server because all that was corrupted was the WAL stream.
You can run
oid2name
to find out which database has OID 16400 and then
oid2name -d <database with OID 16400> -f 559613
to find out which table belongs to file 559613.
Is that table larger than 12 GB? If not, that would mean that page 1671400 is indeed an invalid value.
You didn't tell which PostgreSQL version you are using, but maybe there are replication bugs fixed in later versions that could cause replication problems even without a hardware bug present; read the release notes.
I would perform a new pg_basebackup and reinitialize the slave system.
But what I'd really be worried about is possible data corruption on the master server. Block checksums are cool (turned on if pg_controldata <data directory> | grep checksum gives you 1), but possibly won't detect the effects of memory corruption.
Try something like
pg_dumpall -f /dev/null
on the master and see if there are errors.
Keep your old backups in case you need to repair something!
I am running mongodb 3.2 on ubuntu 14.04 server 64 bit. The mongodb server keeps crashing. Whenever I restart the server I see this:
stop: Unknown instance:
mongod start/running, process 25687
Also on running mongo shell after this I get the following error in it:
src/third_party/gperftools-2.2/src/page_heap_allocator.h:74] FATAL ERROR: Out of memory trying to allocate internal tcmalloc data (bytes, object-size) 131072 48
It was not happening before on my system. How Can I correct this error? It keeps happening every 2-3 hour.
Complete MongoDB Log File : Download Here
EDIT1: Added mongodb log file.
Mongodb stops whenever I try to make a connection.
When I run
sudo service mongod start
I get a message that mongodb is running.
But then when I try to make a connection using PyMongo I get an error that says
Autoreconnect: connection closed
I check my mongodb status:
sudo service mongod status
And it says that my mongodb instance is stopped/waiting.
My mongo log file reports the following:
2015-09-17T18:19:46.816+0000 I NETWORK [initandlisten] waiting for connections on port 7000
2015-09-17T18:19:58.813+0000 I NETWORK [initandlisten] connection accepted from 54.152.111.120:51387 #1 (1 connection now open)
2015-09-17T18:19:58.816+0000 I STORAGE [conn1] _getOpenFile() invalid file index requested 4
2015-09-17T18:19:58.816+0000 I - [conn1] Invariant failure false src/mongo/db/storage/mmap_v1/mmap_v1_extent_manager.cpp 201
2015-09-17T18:19:58.837+0000 I CONTROL [conn1]
This is followed by a lengthy backtrace that I can't decipher, then closes with:
2015-09-17T18:19:58.837+0000 I - [conn1]
***aborting after invariant() failure
I've looked around SO, particularly trying the top two answers here, but haven't been able to figure out how to solve the problem.
I'll also note that last time I tried to connect last week, it was working fine.
I see that the MongoDB documentation says that removing index is by calling db.accounts.dropIndex( { "tax-id": 1 } ). But it does not say whether the node needs to be removed from the replicaset or not.
I tried to take a secondary node in a replicaset offline and restart as a standalone node (in a different port) and tried to drop the index.
But after bringing back the node in the replica set with regular process sudo service mongod start, the mongod process is dying saying the index got corrupted.
Thu Oct 31 19:52:38.098 [repl writer worker 1] Assertion: 15898:error in index possibly corruption consider repairing 382
0xdddd81 0xd9f55b 0xd9fa9c 0x7edb83 0x7fb332 0x7fdc08 0x9d3b50 0x9c796e 0x9deb64 0xac45dd 0xac58df 0xa903fa 0xa924c7 0xa71f6c 0xc273d3 0xc26b18 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd9f55b]
/usr/bin/mongod() [0xd9fa9c]
/usr/bin/mongod(_ZN5mongo11checkFailedEj+0x143) [0x7edb83]
/usr/bin/mongod(_ZNK5mongo12BucketBasicsINS_12BtreeData_V1EE11basicInsertENS_7DiskLocERiS3_RKNS_5KeyV1ERKNS_8OrderingE+0x222) [0x7fb332]
/usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE10insertHereENS_7DiskLocEiS3_RKNS_5KeyV1ERKNS_8OrderingES3_S3_RNS_12IndexDetailsE+0x68) [0x7fdc08]
/usr/bin/mongod(_ZNK5mongo30IndexInsertionContinuationImplINS_12BtreeData_V1EE22doIndexInsertionWritesEv+0xa0) [0x9d3b50]
/usr/bin/mongod(_ZN5mongo14IndexInterface13IndexInserter19finishAllInsertionsEv+0x1e) [0x9c796e]
/usr/bin/mongod(_ZN5mongo24indexRecordUsingTwoStepsEPKcPNS_16NamespaceDetailsENS_7BSONObjENS_7DiskLocEb+0x754) [0x9deb64]
/usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbbPb+0x123d) [0xac45dd]
/usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEbb+0x4f) [0xac58df]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x2eda) [0xa903fa]
/usr/bin/mongod(_ZN5mongo27updateObjectsForReplicationEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0xa924c7]
/usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x65c) [0xa71f6c]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc273d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc26b18]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.106 [repl writer worker 1] ERROR: writer worker caught exception: error in index possibly corruption consider repairing 382 on:
xxxxxxxx--deleted content related to the data...xxxxxxxxxxxxx
Thu Oct 31 19:52:38.106 [repl writer worker 1] Fatal Assertion 16360
0xdddd81 0xd9dc13 0xc26bfc 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd9dc13]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.108 [repl writer worker 1]
***aborting after fassert() failure
Thu Oct 31 19:52:38.108 Got signal: 6 (Aborted).
Is this due to dropping the index in the offline mode on the secondary? Any suggestions on the proper way to drop the index is highly appreciated.
The proper way to remove index from replica set is to drop it on primary. The idea of replica is having the same copy of data (with small time lags). So whenever you do something on primary is copied to the secondaries. So if you start doing anything on the primary, right after it finishes this process, the process propagates to secondaries.
If you are removing index from primary - the index will be removed on the secondary as well.