Overview: I have a 3 member replica set configuration. 1 primary, 1 secondary and 1 arbiter. All 3 are running and are in perfect sync with each other.
Problem: All the instances were stopped for server maintenance. They were started in the following order:
Primary instance.
Secondary instance.
Arbiter instance.
After all the 3 instance services were started, the arbiter failed to start correctly in the replica set. Upon checking the logs, I found the following line being printed repeatedly
Thu Jul 12 18:32:20 [rsStart] replSet can't get local.system.replset config from self or any seed (yet)
Thu Jul 12 18:32:30 [rsStart] replSet can't get local.system.replset config from self or any seed (yet)
Thu Jul 12 18:32:40 [rsStart] replSet can't get local.system.replset config from self or any seed (yet)
Thu Jul 12 18:32:50 [rsStart] replSet can't get local.system.replset config from self or any seed (yet)
Steps performed to resolve:
Stopped the arbiter instance.
Deleted the local db files from arbiter's data folder.
Removed the instances from the replica set (including secondary).
Started the arbiter instance.
Added both the secondary and arbiter instance to the replica set
I am trying to figure out the following questions:
Why is the above log statement printed repeatedly?
What might be the cause of such an issue?
Related
Here is the topology I have running off self hosted Ubuntu machines on AWS (EC2)
MongoDB version 5.0.3
Mongos - 1 server
Config Servers (3 servers)
Shards (3) each is a ReplicaSet with 3 members each, therefore 9 data nodes
For various reasons, which don't seem to be in the logs, all 3 members of Shard 3 went down. Restarting the processes on these 3 members was the most obvious step, however, this was the output.
ubuntu#XXXXXXXX:~$ mongosh
Current Mongosh Log ID: 619aa17d4fac3c77a109b11b
Connecting to: mongodb://127.0.0.1:27017/?directConnection=true&serverSelectionTimeoutMS=2000
MongoNetworkError: connect ECONNREFUSED 127.0.0.1:27017
Eventually, one member came back, as SECONDARY, restarting the mongod processes on the other members seemed to cascade and kill the first member. At least so it appeared.
Following this guide, I entered mongosh on Shard 3, Member 1 (3-1), I then disconnected the other members from the ReplicaSet. After this, Member one refuses to start. The process logs show this
ubuntu#XXXXXXX:~$ sudo systemctl status mongod
● mongod.service - MongoDB Database Server
Loaded: loaded (/lib/systemd/system/mongod.service; enabled; vendor preset: enabled)
Active: failed (Result: signal) since Sun 2021-11-21 20:02:31 UTC; 798ms ago
Docs: https://docs.mongodb.org/manual
Process: 1094 ExecStart=/usr/bin/mongod --config /etc/mongod.conf (code=killed, signal=ABRT)
Main PID: 1094 (code=killed, signal=ABRT)
Nov 21 20:02:26 ip-172-31-32-42 systemd[1]: Started MongoDB Database Server.
Nov 21 20:02:31 ip-172-31-32-42 systemd[1]: mongod.service: Main process exited, code=killed, sta>
Nov 21 20:02:31 ip-172-31-32-42 systemd[1]: mongod.service: Failed with result 'signal'.
Is it possible to get any of these members in RS3 back up?
Is it possible to restore the data that has been sharded on Shard 3, of course this is only part of it, Shard 1 & Shard 2 (and their ReplicaSets are okay)
Currently I have a replicaset for the production data.
I am adding a new members to the replica set. The state of new members becomes SECONDARY (after STARTUP, STARTUP2 etc).
Does that guarantee that all data in primary member has been replicated to the new members?
Is there any way to make sure that no data is lost after replication?
(Is there anything specified in the official docs of MongoDB - any guarantee for data being not lost or something. I am using MongoDB 3.2)
When the initial sync is completed(clones data from source and the applies oplogs to maintain changes in the data set), you can call rs.printSlaveReplicationInfo() from primary mongodb shell.
rs.printSlaveReplicationInfo()
This will return the last oplog entry applied on the secondaries, which are copied from the primaries oplog.rs collection.
The response is returned as:
source: m1.example.net:27017
syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)
0 secs (0 hrs) behind the primary
source: m2.example.net:27017
syncedTo: Thu Apr 10 2014 10:27:47 GMT-0400 (EDT)
0 secs (0 hrs) behind the primary
Notice that both secondary members are 0 seconds behind the primary which indicates no replication lag.
That is essentially a difference b/w last operation recorded on primary and the time is was applied on the secondary.
As an additional precaution, you can note the db.stats() on the primary right before starting the sync and collecting same stats (db.stats()) from newly synced secondaries.
Read about initial sync here
Running MongoDB on my company's QA env, I ran into this error in the log:
2015-02-22T04:48:06.194-0500 [rsHealthPoll] SEVERE: Invalid access at address: 0
2015-02-22T04:48:06.290-0500 [rsHealthPoll] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0xf62526 0xf62300 0xf6241f 0x7fc70b581710 0xca12c2 0xca14e7 0xca3bb6 0xf02995 0xefb6d8 0xf9af1c 0x7fc70b5799d1 0x7fc70a2d28fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xf62526]
/usr/bin/mongod() [0xf62300]
/usr/bin/mongod() [0xf6241f]
/lib64/libpthread.so.0(+0xf710) [0x7fc70b581710]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask12tryHeartbeatEPNS_7BSONObjEPi+0x52) [0xca12c2]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask17_requestHeartbeatERNS_13HeartbeatInfoERNS_7BSONObjERi+0xf7) [0xca14e7]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x96) [0xca3bb6]
/usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x25) [0xf02995]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x128) [0xefb6d8]
/usr/bin/mongod() [0xf9af1c]
/lib64/libpthread.so.0(+0x79d1) [0x7fc70b5799d1]
/lib64/libc.so.6(clone+0x6d) [0x7fc70a2d28fd]
It seems there's some segmentation fault in rsHealthPoll.
This is from a mongod instance running as part of a replica set in a shard-ready cluster (2 mongods + 1 arbiter running with config servers and mongos processes).
This DB mostly receives writes of new records, periodically updating a boolean to True for some records, and some reads, according to user activity querying it. (Single collection at the moment)
Googling this error only gave me other, older, already-solved segmentation fault bugs in MongoDB Jira.
Anyone seen this recently or knows the reason?
I see that the MongoDB documentation says that removing index is by calling db.accounts.dropIndex( { "tax-id": 1 } ). But it does not say whether the node needs to be removed from the replicaset or not.
I tried to take a secondary node in a replicaset offline and restart as a standalone node (in a different port) and tried to drop the index.
But after bringing back the node in the replica set with regular process sudo service mongod start, the mongod process is dying saying the index got corrupted.
Thu Oct 31 19:52:38.098 [repl writer worker 1] Assertion: 15898:error in index possibly corruption consider repairing 382
0xdddd81 0xd9f55b 0xd9fa9c 0x7edb83 0x7fb332 0x7fdc08 0x9d3b50 0x9c796e 0x9deb64 0xac45dd 0xac58df 0xa903fa 0xa924c7 0xa71f6c 0xc273d3 0xc26b18 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd9f55b]
/usr/bin/mongod() [0xd9fa9c]
/usr/bin/mongod(_ZN5mongo11checkFailedEj+0x143) [0x7edb83]
/usr/bin/mongod(_ZNK5mongo12BucketBasicsINS_12BtreeData_V1EE11basicInsertENS_7DiskLocERiS3_RKNS_5KeyV1ERKNS_8OrderingE+0x222) [0x7fb332]
/usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE10insertHereENS_7DiskLocEiS3_RKNS_5KeyV1ERKNS_8OrderingES3_S3_RNS_12IndexDetailsE+0x68) [0x7fdc08]
/usr/bin/mongod(_ZNK5mongo30IndexInsertionContinuationImplINS_12BtreeData_V1EE22doIndexInsertionWritesEv+0xa0) [0x9d3b50]
/usr/bin/mongod(_ZN5mongo14IndexInterface13IndexInserter19finishAllInsertionsEv+0x1e) [0x9c796e]
/usr/bin/mongod(_ZN5mongo24indexRecordUsingTwoStepsEPKcPNS_16NamespaceDetailsENS_7BSONObjENS_7DiskLocEb+0x754) [0x9deb64]
/usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbbPb+0x123d) [0xac45dd]
/usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEbb+0x4f) [0xac58df]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x2eda) [0xa903fa]
/usr/bin/mongod(_ZN5mongo27updateObjectsForReplicationEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0xa924c7]
/usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x65c) [0xa71f6c]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc273d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc26b18]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.106 [repl writer worker 1] ERROR: writer worker caught exception: error in index possibly corruption consider repairing 382 on:
xxxxxxxx--deleted content related to the data...xxxxxxxxxxxxx
Thu Oct 31 19:52:38.106 [repl writer worker 1] Fatal Assertion 16360
0xdddd81 0xd9dc13 0xc26bfc 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd9dc13]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.108 [repl writer worker 1]
***aborting after fassert() failure
Thu Oct 31 19:52:38.108 Got signal: 6 (Aborted).
Is this due to dropping the index in the offline mode on the secondary? Any suggestions on the proper way to drop the index is highly appreciated.
The proper way to remove index from replica set is to drop it on primary. The idea of replica is having the same copy of data (with small time lags). So whenever you do something on primary is copied to the secondaries. So if you start doing anything on the primary, right after it finishes this process, the process propagates to secondaries.
If you are removing index from primary - the index will be removed on the secondary as well.
I am trying to configure a standalone mongodb replica set with 3 instances. I seem to have gotten into a funky state. Two of my instances went down, and I was left with all secondary nodes. I tried to follow this: http://docs.mongodb.org/manual/tutorial/reconfigure-replica-set-with-unavailable-members/
I got this error though:
rs0:SECONDARY> rs.reconfig(cfg, {force : true})
{
"errmsg" : "exception: need most members up to reconfigure, not ok : obfuscated_hostname:27019",
"code" : 13144,
"ok" : 0
}
When I look at the logs I see this:
Fri Aug 2 20:45:11.895 [initandlisten] options: { config: "/etc/mongodb1.conf",
dbpath: "/var/lib/mongodb1", logappend: "true", logpath: "/var/log/mongodb/mongodb1.log",
port: 27018, replSet: "rs0" }
Fri Aug 2 20:45:11.897 [initandlisten] journal dir=/var/lib/mongodb1/journal
Fri Aug 2 20:45:11.897 [initandlisten] recover begin
Fri Aug 2 20:45:11.897 [initandlisten] recover lsn: 0
Fri Aug 2 20:45:11.897 [initandlisten] recover /var/lib/mongodb1/journal/j._0
Fri Aug 2 20:45:11.899 [initandlisten] recover cleaning up
Fri Aug 2 20:45:11.899 [initandlisten] removeJournalFiles
Fri Aug 2 20:45:11.899 [initandlisten] recover done
Fri Aug 2 20:45:11.923 [initandlisten] waiting for connections on port 27018
Fri Aug 2 20:45:11.925 [websvr] admin web console waiting for connections on port 28018
Fri Aug 2 20:45:11.927 [rsStart] replSet I am hostname_obfuscated:27018
Fri Aug 2 20:45:11.927 [rsStart] replSet STARTUP2
Fri Aug 2 20:45:11.929 [rsHealthPoll] replset info hostname_obf:27017 thinks that we are down
Fri Aug 2 20:45:11.929 [rsHealthPoll] replSet member hostname_obf:27017 is up
Fri Aug 2 20:45:11.929 [rsHealthPoll] replSet member hostname_obf:27017 is now in state SECONDARY
Fri Aug 2 20:45:12.587 [initandlisten] connection accepted from ip_obf:52446 #1 (1 connection now open)
Fri Aug 2 20:45:12.587 [initandlisten] connection accepted from ip_obf:52447 #2 (2 connections now open)
Fri Aug 2 20:45:12.588 [conn1] end connection ip_obf:52446 (1 connection now open)
Fri Aug 2 20:45:12.928 [rsSync] replSet SECONDARY
I'm unable to connect to the mongo instances, even though the logs say that it is up and running. Any ideas on what to do here?
You did not mention which version of mongodb you are using, but I assume it is post-2.0.
I think the problem with your forced reconfiguration is that after this reconfiguration, you still need to have the minimum number of nodes for a functioning replica set, i.e. 3. But since you originally had 3 members and lost 2, there is no way you could turn that single surviving node into a functioning replica set.
Your only option for recovery would be to bring up the surviving node as a stand-alone server, backup the database, and then create a new 3-node replica set with that data.
Yes you can turn up a single secondary replica to primary if the secondary server is running fine.Do follow the below simple steps:
Step 1: Connect to member and check the current configuration
rs.conf()
Step 2: Save the current configuration to another variable.
x = rs.conf()
Step 3: Select the id,host and port of the member that is to be made as primary.
x.members = [{"_id":1,"host" : "localhost.localdomain:27017"}]
Step 4: Reconfigure the new replica set by force.
rs.reconfig(x, {force:true})
Now the desired member will be promoted as the primary.