MongoDB Primary fails to come back - mongodb

I have a MongoDB 3 member replica set running on Windows. When the primary server (S1) goes down, the secondary is elected correctly. When the primary server comes back up, the replica member stays in an invalid state:
{
"state" : 10,
"stateStr" : "REMOVED",
"uptime" : 111,
"optime" : Timestamp(1448462710, 6),
"optimeDate" : ISODate("2015-11-25T14:45:10Z"),
"ok" : 0,
"errmsg" : "Our replica set config is invalid or we are not a member of it",
"code" : 93
}
After that, the secondary, keeps switching between primary and secondary every few seconds, making my application unstable.
The only way to bring the primary server back is by doing a rs.reconfig(c).
I couldn't find anything wrong with the config files.
Any help will be appreciated.
UPDATE:
Here's the current config:
{
"_id" : "companyName",
"version" : 32593,
"protocolVersion" : NumberLong(1),
"members" : [
{
"_id" : 1,
"host" : "arb.companyName.com:40000",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 2,
"host" : "m3.companyName.com:40000",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 11,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 4,
"host" : "m2.companyName.com:40000",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 3,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 2000,
"heartbeatTimeoutSecs" : 10,
"electionTimeoutMillis" : 10000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("573dfcd0e8ae6154ff80c50d")
}
}
Should I be using IP addresses rather than host names?
UPDATE 2:
This is the log for the primary (m3.companyName.com - IP 1.1.1.1) from when it was rebooted, until it I went into the other server (m2.companyName.com - IP 2.2.2.2) and did a manual rs.reconfig().
2016-09-06T07:42:05.953Z I NETWORK [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-09-06T07:42:05.953Z I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory 'c:/mongossl/data3/diagnostic.data'
2016-09-06T07:42:05.954Z I NETWORK [initandlisten] waiting for connections on port 40000 ssl
2016-09-06T07:42:05.955Z W NETWORK [ReplicationExecutor] getaddrinfo("arb.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.955Z I NETWORK [ReplicationExecutor] getaddrinfo("arb.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.957Z W NETWORK [ReplicationExecutor] getaddrinfo("m3.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.957Z I NETWORK [ReplicationExecutor] getaddrinfo("m3.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.958Z W NETWORK [ReplicationExecutor] getaddrinfo("m2.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.959Z I NETWORK [ReplicationExecutor] getaddrinfo("m2.companyName.com") failed: errno:11001 No such host is known.
2016-09-06T07:42:05.959Z W REPL [ReplicationExecutor] Locally stored replica set configuration does not have a valid entry for the current node; waiting for reconfig or remote heartbeat; Got "NodeNotFound: No host described in new configuration 32592 for replica set companyName2 maps to this node" while validating { _id: "companyName2", version: 32592, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:42:05.959Z I REPL [ReplicationExecutor] New replica set config in use: { _id: "companyName2", version: 32592, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:42:05.959Z I REPL [ReplicationExecutor] This node is not a member of the config
2016-09-06T07:42:05.959Z I REPL [ReplicationExecutor] transition to REMOVED
2016-09-06T07:42:05.959Z I REPL [ReplicationExecutor] Starting replication applier threads
2016-09-06T07:42:06.651Z I NETWORK [initandlisten] connection accepted from 2.2.2.2:53746 #1 (1 connection now open)
2016-09-06T07:42:06.760Z I NETWORK [initandlisten] connection accepted from 2.2.2.2:53747 #2 (2 connections now open)
2016-09-06T07:42:06.864Z I NETWORK [initandlisten] connection accepted from 2.2.2.2:53748 #3 (3 connections now open)
2016-09-06T07:42:06.993Z I ACCESS [conn1] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.067Z I ACCESS [conn2] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.159Z I ACCESS [conn3] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:07.552Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:07.627Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:08.975Z I NETWORK [conn1] end connection 2.2.2.2:53746 (2 connections now open)
2016-09-06T07:42:08.975Z I NETWORK [conn2] end connection 2.2.2.2:53747 (2 connections now open)
2016-09-06T07:42:08.975Z I NETWORK [conn3] end connection 2.2.2.2:53748 (2 connections now open)
2016-09-06T07:42:09.371Z I NETWORK [initandlisten] connection accepted from 2.2.2.2:53763 #4 (1 connection now open)
2016-09-06T07:42:09.639Z I ACCESS [conn4] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:13.059Z I NETWORK [initandlisten] connection accepted from 3.3.3.3:58220 #5 (2 connections now open)
2016-09-06T07:42:13.127Z I ACCESS [conn5] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=arb.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:42:13.292Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to arb.companyName.com:40000
2016-09-06T07:42:13.301Z I REPL [ReplicationExecutor] Member arb.companyName.com:40000 is now in state ARBITER
2016-09-06T07:42:13.974Z I NETWORK [initandlisten] connection accepted from 2.2.2.2:53765 #6 (3 connections now open)
2016-09-06T07:42:14.433Z I ACCESS [conn6] Successfully authenticated as principal appUser on companyName
2016-09-06T07:42:16.629Z I NETWORK [initandlisten] connection accepted from 1.1.1.13:49162 #7 (4 connections now open)
2016-09-06T07:42:16.853Z I ACCESS [conn7] Successfully authenticated as principal appUser on companyName
2016-09-06T07:42:17.703Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:17.703Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:18.131Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:18.206Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:42:23.369Z I NETWORK [initandlisten] connection accepted from 2.2.2.2:53767 #8 (5 connections now open)
2016-09-06T07:42:23.832Z I ACCESS [conn8] Successfully authenticated as principal sa on admin
2016-09-06T07:42:28.356Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:38.431Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:38.431Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:38.861Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:38.936Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:42:49.086Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:42:59.161Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:42:59.161Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:42:59.590Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:42:59.665Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:09.814Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:19.889Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:43:19.889Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:43:20.317Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:43:20.392Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:30.542Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:34.054Z I NETWORK [initandlisten] connection accepted from 1.1.1.13:49188 #9 (6 connections now open)
2016-09-06T07:43:34.106Z I ACCESS [conn9] Successfully authenticated as principal sa on admin
2016-09-06T07:43:40.617Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:43:40.617Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:43:41.045Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:43:41.120Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:43:51.270Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:43:51.277Z I NETWORK [initandlisten] connection accepted from 1.1.1.13:49193 #10 (7 connections now open)
2016-09-06T07:43:51.339Z I ACCESS [conn10] Successfully authenticated as principal sa on admin
2016-09-06T07:44:01.346Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:01.346Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:01.775Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:01.850Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:12.001Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:44:22.077Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:22.077Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:22.506Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:22.582Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:32.732Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:44:42.807Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:44:42.807Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:44:43.237Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:44:43.312Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:44:53.462Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:03.537Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:03.537Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:03.966Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:04.041Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:14.191Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:24.266Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:24.266Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:24.700Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:24.775Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:34.925Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:45:45.000Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:45:45.000Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:45:45.428Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:45:45.504Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:45:55.654Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:05.729Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:46:05.729Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:46:06.157Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:06.232Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:46:16.382Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:26.458Z I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to m2.companyName.com:40000
2016-09-06T07:46:26.458Z I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-09-06T07:46:26.889Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:26.964Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state SECONDARY
2016-09-06T07:46:37.115Z I REPL [ReplicationExecutor] Member m2.companyName.com:40000 is now in state PRIMARY
2016-09-06T07:46:43.185Z I NETWORK [initandlisten] connection accepted from 2.2.2.2:53847 #11 (8 connections now open)
2016-09-06T07:46:43.392Z I ACCESS [conn11] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=m2.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:46:43.541Z I NETWORK [conn11] end connection 2.2.2.2:53847 (7 connections now open)
2016-09-06T07:46:44.370Z I NETWORK [initandlisten] connection accepted from 3.3.3.3:58224 #12 (8 connections now open)
2016-09-06T07:46:44.434Z I ACCESS [conn12] authenticate db: $external { authenticate: 1, mechanism: "MONGODB-X509", user: "CN=arb.companyName.com,O=companyName,ST=ON,C=CA" }
2016-09-06T07:46:44.451Z I NETWORK [conn12] end connection 3.3.3.3:58224 (7 connections now open)
2016-09-06T07:46:47.832Z I REPL [ReplicationExecutor] New replica set config in use: { _id: "companyName2", version: 32593, protocolVersion: 1, members: [ { _id: 1, host: "arb.companyName.com:40000", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "m3.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 11.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 4, host: "m2.companyName.com:40000", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dfcd0e8ae6154ff80c50d') } }
2016-09-06T07:46:47.832Z I REPL [ReplicationExecutor] This node is m3.companyName.com:40000 in the config
2016-09-06T07:46:47.832Z I REPL [ReplicationExecutor] transition to STARTUP2
2016-09-06T07:46:47.907Z I REPL [ReplicationExecutor] Scheduling priority takeover at 2016-09-06T03:46:57.907-0400
2016-09-06T07:46:48.040Z I REPL [ReplicationExecutor] syncing from: m2.companyName.com:40000
2016-09-06T07:46:48.545Z I REPL [SyncSourceFeedback] setting syncSourceFeedback to m2.companyName.com:40000
2016-09-06T07:46:48.977Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:50.983Z I REPL [ReplicationExecutor] transition to RECOVERING
2016-09-06T07:46:50.985Z I REPL [ReplicationExecutor] transition to SECONDARY
2016-09-06T07:46:51.438Z I REPL [ReplicationExecutor] could not find member to sync from
2016-09-06T07:46:57.907Z I REPL [ReplicationExecutor] Canceling priority takeover callback
2016-09-06T07:46:57.907Z I REPL [ReplicationExecutor] Starting an election for a priority takeover
2016-09-06T07:46:57.907Z I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-09-06T07:46:57.916Z I REPL [ReplicationExecutor] dry election run succeeded, running for election
2016-09-06T07:46:57.925Z I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 244
2016-09-06T07:46:57.925Z I REPL [ReplicationExecutor] transition to PRIMARY
2016-09-06T07:46:58.345Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:58.362Z I ASIO [NetworkInterfaceASIO-0] Successfully connected to m2.companyName.com:40000
2016-09-06T07:46:58.440Z I REPL [rsSync] transition to primary complete; database writes are now permitted
The most obvious thing I noticed is the "No such host is known" error. Maybe Mongo is trying to start before Windows can resolve the names?

Please delay startup of mongo. This will resolve this issue.

I got the same problem when I tried to replace a secondary from a backup. The problem was I started the mongod process in the backup server before it was reachable to the replica set (before switching from the old server to the new [from backup] server). After restarting the mongod process the problem was solved.
My suggestion is to start the mongod process only after it is reachable to the replica set it should belong to.

Related

data lost in mongodb replica set mode

My replica set has two nodes:
1: the master node
2: a slave node with priority:0, votes:0
The oplog size is 5000MB.
run this for loop in master shell:
for (i=0;i<1000000;i++)
{
db.getSiblingDB("ff").c.insert(
{ a:i,
d:i+".#234"+(++i)+".234546"+(++i)+".568679"+(++i)+"31234."+(++i)+".12342354"+(++i)+"5346457."+(++i)+"33543465456."+(++i)+".6346456"+(++i)+"123235434."+(++i)+".2345345345"+(++i)
}
)
}
Kill the slave node while the for loop is running: kill -9 $(pidof slave_node)
Stop the for loop after a second; then restart the slave node.
Then run db.getSiblingDB("ff").c.count() to check data in both slave and master nodes, with the results:
master:20w
slave:15w
The slave node can catch up with the primary, but there is a lot of data lost from the slave.
Why is this?
Here is the slave node's log as it restarts after being killed:
2017-11-27T05:53:53.873+0000 I NETWORK [thread1] waiting for connections on port 28006
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] New replica set config in use: { _id: "cpconfig2", version: 2, protocolVersion: 1, members: [ { _id: 0, host: "127.0.0.1:28007", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "127.0.0.1:28006", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 0.0, tags: {}, slaveDelay: 0, votes: 0 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('5a1ba5bbb0a652502a5f002a') } }
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] This node is 127.0.0.1:28006 in the config
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] transition to STARTUP2
2017-11-27T05:53:53.876+0000 I REPL [replExecDBWorker-0] Starting replication storage threads
2017-11-27T05:53:53.877+0000 I REPL [replExecDBWorker-0] Starting replication fetcher thread
2017-11-27T05:53:53.877+0000 I REPL [replExecDBWorker-0] Starting replication applier thread
2017-11-27T05:53:53.877+0000 I REPL [replExecDBWorker-0] Starting replication reporter thread
2017-11-27T05:53:53.877+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 127.0.0.1:28007
2017-11-27T05:53:53.877+0000 I REPL [rsSync] transition to RECOVERING
2017-11-27T05:53:53.878+0000 I REPL [rsSync] transition to SECONDARY
2017-11-27T05:53:53.879+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 127.0.0.1:28007, took 2ms (1 connections now open to 127.0.0.1:28007)
2017-11-27T05:53:53.879+0000 I REPL [ReplicationExecutor] Member 127.0.0.1:28007 is now in state PRIMARY
2017-11-27T05:53:54.011+0000 I FTDC [ftdc] Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost. OK
2017-11-27T05:53:54.645+0000 I NETWORK [thread1] connection accepted from 127.0.0.1:52404 #1 (1 connection now open)
2017-11-27T05:53:54.645+0000 I NETWORK [conn1] received client metadata from 127.0.0.1:52404 conn1: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.9" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 8 (jessie)"", architecture: "x86_64", version: "Kernel 3.10.0" } }
2017-11-27T05:53:59.878+0000 I REPL [rsBackgroundSync] sync source candidate: 127.0.0.1:28007
See the page Accuracy after Unexpected Shutdown for details and information on how to recover from this situation.

could not get updated shard list from config server due to Operation timed out

mongodb:v3.4.9
error logs from mongodb c++ source code: https://github.com/mongodb/mongo/blob/367d31e1da549c460ae710a8cc280f4c235ab24f/src/mongo/s/client/shard_registry.cpp#L384
Mongos throw error when i add new node to a shard cluster and all the enableShardCollection can't query (ExceededTimeLimit).
It's can be repaired?
Marking host config.app.com as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 871 -- target:config.app.com db:config expDate:2017-10-21T13:16:38.250+0000 cmd:{ find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508586527000|1, t: 24 } }, maxTimeMS: 30000 }
2017-10-21T13:16:38.250+0000 I SHARDING [shard registry reload] Operation timed out :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 871 -- target:config.app.com db:config expDate:2017-10-21T13:16:38.250+0000 cmd:{ find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508586527000|1, t: 24 } }, maxTimeMS: 30000 }
2017-10-21T13:16:38.250+0000 I SHARDING [shard registry reload] Periodic reload of shard registry failed :: caused by :: 50 could not get updated shard list from config server due to Operation timed out, request was RemoteCommand 871 -- target:config.app.com db:config expDate:2017-10-21T13:16:38.250+0000 cmd:{ find: "shards", readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1508586527000|1, t: 24 } }, maxTimeMS: 30000 }; will retry after 30s
This happened to me and after hours of debugging I found that my config server was started without the configsvr: true option in rs.initiate. So mongos was requesting data from my config server but the config server didn't know how to respond. fwiw, I had
sharding:
clusterRole: configsvr
in my conf file but looks like that wasn't picked up.

Mongos can add replica set, but can't connect

I'm setting up a sharded mongo cluster. I have two replica sets consisting of two nodes each, a replica set of three config servers, and a single mongos instance.
I have been able to add the replica set to the mongos instance:
sh.addShard("rs1/shard-rs01-s01");
This returns {"ok" : 1} and the same is true of the second replica set.
However when I try to do any database operations such as db.test.insert(...) I receive this error:
2017-02-23T01:17:28.599+0000 I ASIO [CatalogManagerReplacer]
Connecting to shard-RS01-S01:27017
2017-02-23T01:17:28.600+0000 I ASIO [CatalogManagerReplacer]
Connecting to config-01:27019
2017-02-23T01:17:28.603+0000 I ASIO [CatalogManagerReplacer]
Successfully connected to config-01:27019
2017-02-23T01:17:48.600+0000 I ASIO [CatalogManagerReplacer] Failed to connect to shard-RS01-S01:27017 - ExceededTimeLimit: Operation timed out
I double checked that the firewall wasn't blocking the connection by disabling it on all of the systems. For what it is worth, on the node that contains the mongos instance I can connect to the replica-set directly through the command like using this command regardless of the firewall state:
mongo --host rs1/shard-rs01-s01:27017
So I am fairly sure it is not a firewall issue. Anyone have any ideas?
Here's a shard map of the setup if it is useful for anyone able to help...
mongos> db.runCommand("getShardMap")
{
"map" : {
"config" : "rs0/config-01:27019,config-02:27019,config-03:27019",
"config-01:27019" : "rs0/config-01:27019,config-02:27019,config-03:27019",
"config-02:27019" : "rs0/config-01:27019,config-02:27019,config-03:27019",
"config-03:27019" : "rs0/config-01:27019,config-02:27019,config-03:27019",
"rs0/config-01:27019,config-02:27019,config-03:27019" : "rs0/config-01:27019,config-02:27019,config-03:27019",
"rs1" : "rs1/shard-RS01-S01:27017,shard-RS01-S02:27017",
"rs1/shard-RS01-S01:27017,shard-RS01-S02:27017" : "rs1/shard-RS01-S01:27017,shard-RS01-S02:27017",
"rs2" : "rs2/shard-RS02-S03:27017,shard-RS02-S04:27017",
"rs2/shard-RS02-S03:27017,shard-RS02-S04:27017" : "rs2/shard-RS02-S03:27017,shard-RS02-S04:27017",
"shard-RS01-S01:27017" : "rs1/shard-RS01-S01:27017,shard-RS01-S02:27017",
"shard-RS01-S02:27017" : "rs1/shard-RS01-S01:27017,shard-RS01-S02:27017",
"shard-RS02-S03:27017" : "rs2/shard-RS02-S03:27017,shard-RS02-S04:27017",
"shard-RS02-S04:27017" : "rs2/shard-RS02-S03:27017,shard-RS02-S04:27017"
},
"ok" : 1
}
you need to initialize your mongos.
rs.initiate( { _id: "configReplSet", configsvr: true, members: [ { _id: 0, host: "mongo-config-1:27017" }] } )

MongoDs in ReplSet won't start after trying out some MapReduce

I was practicing some MapReduce inside of my Primary's mongo shell when it suddenly became a Secondary. I SSHed into the two other VM's with the other secondaries, and discovered that the mongod's had been rendered inoperaple. I killed them and I issued the mongod --config /etc/mongod.conf to kick them off and I entered the mongo shell. After a few seconds they were interrupted with:
2014-09-14T22:29:54.142-0500 DBClientCursor::init call() failed
2014-09-14T22:29:54.143-0500 trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
2014-09-14T22:29:54.143-0500 warning: Failed to connect to 127.0.0.1:27017, reason: errno:111 Connection refused
2014-09-14T22:29:54.143-0500 reconnect 127.0.0.1:27017 (127.0.0.1) failed failed couldn't connect to server 127.0.0.1:27017 (127.0.0.1), connection attempt failed
>
This is from their (the two original secondaries in the replicaset) logs:
2014-09-14T22:09:21.879-0500 [rsBackgroundSync] replSet syncing to: vm-billing-001:27017
2014-09-14T22:09:21.880-0500 [rsSync] replSet still syncing, not yet to minValid optime 54165090:1
2014-09-14T22:09:21.882-0500 [rsBackgroundSync] replset setting syncSourceFeedback to vm-billing-001:27017
2014-09-14T22:09:21.886-0500 [rsSync] replSet SECONDARY
2014-09-14T22:09:21.886-0500 [repl writer worker 1] build index on: test.tmp.mr.CCS.nonconforming_1_inc properties: { v: 1, key: { 0: 1 }, name: "_temp_0", ns: "test.tmp.mr.CCS.nonconforming_1_inc" }
2014-09-14T22:09:21.887-0500 [repl writer worker 1] added index to empty collection
2014-09-14T22:09:21.887-0500 [repl writer worker 1] build index on: test.tmp.mr.CCS.nonconforming_1 properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.tmp.mr.CCS.nonconforming_1" }
2014-09-14T22:09:21.887-0500 [repl writer worker 1] added index to empty collection
2014-09-14T22:09:21.888-0500 [repl writer worker 1] build index on: test.tmp.mr.CCS.nonconforming_1 properties: { v: 1, unique: true, key: { id: 1.0 }, name: "id_1", ns: "test.tmp.mr.CCS.nonconforming_1" }
2014-09-14T22:09:21.888-0500 [repl writer worker 1] added index to empty collection
2014-09-14T22:09:21.891-0500 [repl writer worker 2] ERROR: writer worker caught exception: :: caused by :: 11000 insertDocument :: caused by :: 11000 E11000 duplicate key error index: cisco.tmp.mr.CCS.nonconforming_1.$id_1 dup key: { : null } on: { ts: Timestamp 1410748561000|46, h: 9014687153249982311, v: 2, op: "i", ns: "cisco.tmp.mr.CCS.nonconforming_1", o: { _id: 14, value: 1.0 } }
2014-09-14T22:09:21.891-0500 [repl writer worker 2] Fatal Assertion 16360
2014-09-14T22:09:21.891-0500 [repl writer worker 2]
I can issue mongo --host ... --port ... from both of the two VMs that can't start the mongo to the original primary mongo, but I do see some connection refused notes above in the error log.
My original primary mongod can still be connected to in the mongo shell, but it is a primary. I can kill it and restart it and it will start up in secondary.
How can I roll back to the last known state and restart my replica set?

MongoDB sharding problems

Our mongodb server deployed with 2 shards, each has 1 master server and 2 slave servers.
The four slave servers run mongo config as proxy, and two of the slave servers run arbiters.
But the mongodb coundn't be used now.
I can connect to 192.168.0.1:8000(mongos) and exec queries like 'use database' or 'show dbs', but i cann't exec queries in a choosed database such as 'db.foo.count()', 'db.foo.findOne()'
Here is the error log:
mongos> db.dev.count()
Fri Aug 16 12:55:36 uncaught exception: count failed: {
"assertion" : "DBClientBase::findN: transport error: 10.81.4.72:7100 query: { setShardVersion: \"\", init: true, configdb: \"10.81.4.72:7300,10.42.50.26:7300,10.81.51.235:7300\", serverID: ObjectId('520db0a51fa00999772612b9'), authoritative: true }",
"assertionCode" : 10276,
"errmsg" : "db assertion failure",
"ok" : 0
}
Fri Aug 16 11:23:29 [conn8431] DBClientCursor::init call() failed
Fri Aug 16 11:23:29 [conn8430] Socket recv() errno:104 Connection reset by peer 10.81.4.72:7100
Fri Aug 16 11:23:29 [conn8430] SocketException: remote: 10.81.4.72:7100 error: 9001 socket exception [1] server [10.81.4.72:7100]
Fri Aug 16 11:23:29 [conn8430] DBClientCursor::init call() failed
Fri Aug 16 11:23:29 [conn8430] DBException in process: could not initialize cursor across all shards because : DBClientBase::findN: transport error: 10.81.4.72:7100 query: { setShardVersion: "", init: true, configdb: "10.81.4.72:7300,10.42.50.26:7300,10.81.51.235:7300", serverID: ObjectId('520d99c972581e6a124d0561'), authoritative: true } # s01/10.36.31.36:7100,10.42.50.24:7100,10.81.4.72:7100
i can only start on mongos, queries wouldn't be exec if more than 1 mongos run at the same time, error log:
mongos> db.dev.count() Fri Aug 16 15:12:29 uncaught exception: count failed: { "assertion" : "DBClientBase::findN: transport error: 10.81.4.72:7100 query: { setShardVersion: \"\", init: true, configdb: \"10.81.4.72:7300,10.42.50.26:7300,10.81.51.235:7300\", serverID: ObjectId('520dd04967557902f73a9fba'), authoritative: true }", "assertionCode" : 10276, "errmsg" : "db assertion failure", "ok" : 0 }
Could you please clarify if your set-up was working before, if you are just setting it up now?
To repair your MongoDB, you might want to follow this link:
http://docs.mongodb.org/manual/tutorial/recover-data-following-unexpected-shutdown/
References
MongoDB Documentation : Deploying a Shard-Cluster
MongoDB Documentation : Add Shards to an existing cluster
Older, outdated(!) info:
YouTube Video on Setting-up Sharding for MongoDB
Corresponding Blog on blog.serverdensity.com