I have a test case, a sharding cluster with 1 shard.
The shard is rs, which has 1 primary and 2 secondaries.
My application uses secondaryPreferred policy, at first the queries balanced over two secondaries. Then I stop 1 secondary 10.160.243.22 to simulate fault, and then reboot it, the status is ok:
rs10032:PRIMARY> rs.status()
{
"set" : "rs10032",
"date" : ISODate("2014-12-05T09:21:07Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "10.160.243.22:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2211,
"optime" : Timestamp(1417771218, 3),
"optimeDate" : ISODate("2014-12-05T09:20:18Z"),
"lastHeartbeat" : ISODate("2014-12-05T09:21:05Z"),
"lastHeartbeatRecv" : ISODate("2014-12-05T09:21:07Z"),
"pingMs" : 0,
"lastHeartbeatMessage" : "syncing to: 10.160.188.52:27017",
"syncingTo" : "10.160.188.52:27017"
},
{
"_id" : 1,
"name" : "10.160.188.52:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 2211,
"optime" : Timestamp(1417771218, 3),
"optimeDate" : ISODate("2014-12-05T09:20:18Z"),
"electionTime" : Timestamp(1417770837, 1),
"electionDate" : ISODate("2014-12-05T09:13:57Z"),
"self" : true
},
{
"_id" : 2,
"name" : "10.160.189.52:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2209,
"optime" : Timestamp(1417771218, 3),
"optimeDate" : ISODate("2014-12-05T09:20:18Z"),
"lastHeartbeat" : ISODate("2014-12-05T09:21:07Z"),
"lastHeartbeatRecv" : ISODate("2014-12-05T09:21:06Z"),
"pingMs" : 0,
"syncingTo" : "10.160.188.52:27017"
}
],
"ok" : 1
}
but all queries go to another secondary 10.160.188.52, and 10.160.243.22 is idle
Why the queries not balanced to two secondaries after recovery and how to fix it ?
Your application uses some kind of driver(I don't know exact technology stack you are using) to connect to MongoDb. Your driver could remember(cache) replica set status or connections for some period of time. So, there is no guarantee that secondary node will be available immediately after a recovery.
Related
I'm having a replica set, and to free some disk space, I want to resync my replica set members.
Thus, on the SECONDARY member of the replica set, I've emptied the /var/lib/mongodb/ directory which holds the data for the database.
When I open a shell to the Replication Set, and execute the command rs.status(), the following is showed.
{
"set" : "rs1",
"date" : ISODate("2016-12-13T08:28:00.414Z"),
"myState" : 5,
"term" : NumberLong(29),
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "10.20.2.87:27017",
"health" : 1.0,
"state" : 5,
"stateStr" : "SECONDARY",
"uptime" : 148,
"optime" : {
"ts" : Timestamp(6363490787761586, 1),
"t" : NumberLong(29)
},
"optimeDate" : ISODate("2016-12-13T07:54:16.000Z"),
"infoMessage" : "could not find member to sync from",
"configVersion" : 3,
"self" : true
},
{
"_id" : 1,
"name" : "10.20.2.95:27017",
"health" : 1.0,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 146,
"optime" : {
"ts" : Timestamp(6363490787761586, 1),
"t" : NumberLong(29)
},
"optimeDate" : ISODate("2016-12-13T07:54:16.000Z"),
"lastHeartbeat" : ISODate("2016-12-13T08:27:58.435Z"),
"lastHeartbeatRecv" : ISODate("2016-12-13T08:27:59.447Z"),
"pingMs" : NumberLong(0),
"electionTime" : Timestamp(6363486827801739, 1),
"electionDate" : ISODate("2016-12-13T07:38:54.000Z"),
"configVersion" : 3
},
{
"_id" : 2,
"name" : "10.20.2.93:30001",
"health" : 1.0,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 146,
"lastHeartbeat" : ISODate("2016-12-13T08:27:58.437Z"),
"lastHeartbeatRecv" : ISODate("2016-12-13T08:27:59.394Z"),
"pingMs" : NumberLong(0),
"configVersion" : 3
}
],
"ok" : 1.0
}
Why does my secondary member shows `Could not find member to sync from, however, my primary is up and running."
My collection is sharded, over 6 servers, and I have this message on 2 replica set members. The ones which have the SECONDARY member on top in the members array when requesting the replication set status.
I really would like to get rid of this error message.
It scares me :-)
Kind regards
I had a similar problem, and it was due to the fact that the heartbeat timeout was too short, you can solve that problem here
I am trying to intialise a mongodb replica set but whenever I add the new node it never makes it past state 3 (RECOVERING). Here is a snapshot from rs.status():
rs0:OTHER> rs.status()
{
"set" : "rs0",
"date" : ISODate("2015-04-27T14:09:21.973Z"),
"myState" : 3,
"members" : [
{
"_id" : 0,
"name" : "10.0.1.184:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 6899,
"optime" : Timestamp(1430143759, 9),
"optimeDate" : ISODate("2015-04-27T14:09:19Z"),
"lastHeartbeat" : ISODate("2015-04-27T14:09:20.133Z"),
"lastHeartbeatRecv" : ISODate("2015-04-27T14:09:20.160Z"),
"pingMs" : 0,
"electionTime" : Timestamp(1430127299, 1),
"electionDate" : ISODate("2015-04-27T09:34:59Z"),
"configVersion" : 109483
},
{
"_id" : 1,
"name" : "10.0.1.119:27017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 6899,
"lastHeartbeat" : ISODate("2015-04-27T14:09:20.133Z"),
"lastHeartbeatRecv" : ISODate("2015-04-27T14:09:20.166Z"),
"pingMs" : 0,
"configVersion" : 109483
},
{
"_id" : 2,
"name" : "10.0.1.179:27017",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 15651,
"optime" : Timestamp(1430136863, 2),
"optimeDate" : ISODate("2015-04-27T12:14:23Z"),
"infoMessage" : "could not find member to sync from",
"configVersion" : 109483,
"self" : true
}
],
"ok" : 1
}
Occasionally the infoMessage "could not find member to sync from" is visible on the new node. I note that the oplog on the current primary is only 0.12 hours (1.7GB) and that it is taking approx. 2 hours to copy over the majority of the dataset (as seen by network usage).
Is it correct to assume that the oplog must be greater than this 2 hour period for the initial sync to complete successfully?
It was indeed necessary for the oplog to be LARGER (in time) than the expected time to synchronise the data between two replicas. Disk is cheap so we increased our OPLOG to 50GB and restarted the sync, worked first time.
From the MongoDB documentation:
At this point, the mongod will perform an initial sync. The length of the initial sync process depends on the size of the database and network connection between members of the replica set.
Source
My question in very simple, how can I know when it's safe to stepDown the PRIMARY member of my replica set? I just upgrated my secondary to use WiredTiger.
Output of rs.status():
{
"set" : "m0",
"date" : ISODate("2015-03-18T09:59:21.486Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "example.com",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 4642,
"optime" : Timestamp(1426672500, 1),
"optimeDate" : ISODate("2015-03-18T09:55:00Z"),
"electionTime" : Timestamp(1426668268, 1),
"electionDate" : ISODate("2015-03-18T08:44:28Z"),
"configVersion" : 7,
"self" : true
},
{
"_id" : 1,
"name" : "example.com"",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1309,
"optime" : Timestamp(1426672500, 1),
"optimeDate" : ISODate("2015-03-18T09:55:00Z"),
"lastHeartbeat" : ISODate("2015-03-18T09:59:20.968Z"),
"lastHeartbeatRecv" : ISODate("2015-03-18T09:59:20.762Z"),
"pingMs" : 0,
"syncingTo" : "example.com"",
"configVersion" : 7
},
{
"_id" : 2,
"name" : "example.com"",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 4640,
"lastHeartbeat" : ISODate("2015-03-18T09:59:21.009Z"),
"lastHeartbeatRecv" : ISODate("2015-03-18T09:59:21.238Z"),
"pingMs" : 59,
"configVersion" : 7
}
],
"ok" : 1
}
Found the solution:
While performing the inital sync, the status is RECOVERING
Could you please tell me if this will cause any issues with failover? For example, what would happen if host mongo2.local is down? (assuming the original host and the arbiter go down and only 2 members are left). Will the rest of the members be able to elect a new primary ever?
I know that there shouldn't be an arbiter here as it makes things worse but I wanted to know if a failover will occur in case of this setup and mongo2.local go down.
mongo:ARBITER> rs.status()
{
"set" : "mongo",
"date" : ISODate("2015-02-12T09:00:08Z"),
"myState" : 7,
"members" : [
{
"_id" : 0,
"name" : "mongo1.local:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2572473,
"optime" : Timestamp(1423731603, 4),
"optimeDate" : ISODate("2015-02-12T09:00:03Z"),
"lastHeartbeat" : ISODate("2015-02-12T09:00:07Z"),
"lastHeartbeatRecv" : ISODate("2015-02-12T09:00:07Z"),
"pingMs" : 0,
"syncingTo" : "mongo2.local:27017"
},
{
"_id" : 1,
"name" : "mongo2.local:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 12148099,
"optime" : Timestamp(1423731603, 4),
"optimeDate" : ISODate("2015-02-12T09:00:03Z"),
"lastHeartbeat" : ISODate("2015-02-12T09:00:08Z"),
"lastHeartbeatRecv" : ISODate("2015-02-12T09:00:08Z"),
"pingMs" : 0,
"electionTime" : Timestamp(1423711411, 1),
"electionDate" : ISODate("2015-02-12T03:23:31Z")
},
{
"_id" : 2,
"name" : "mongo3.local:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 5474488,
"optime" : Timestamp(1423731603, 4),
"optimeDate" : ISODate("2015-02-12T09:00:03Z"),
"lastHeartbeat" : ISODate("2015-02-12T09:00:07Z"),
"lastHeartbeatRecv" : ISODate("2015-02-12T09:00:08Z"),
"pingMs" : 139,
"syncingTo" : "mongo2.local:27017"
},
{
"_id" : 3,
"name" : "mongo2.local:27020",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 12148101,
"self" : true
}
],
"ok" : 1
}
and:
mongo:ARBITER> rs.config()
{
"_id" : "mongo",
"version" : 5,
"members" : [
{
"_id" : 0,
"host" : "mongo1.local:27017",
"priority" : 0.5
},
{
"_id" : 1,
"host" : "mongo2.local:27017"
},
{
"_id" : 2,
"host" : "mongo3.local:27017",
"priority" : 0.5
},
{
"_id" : 3,
"host" : "mongo2.local:27020",
"arbiterOnly" : true
}
]
}
If you have less than a majority of the votes in a replica set available, the replica set will not be able to elect or maintain a primary and the replica set will be unhealthy and will be read-only. Ergo, if you only have 2 of your 4 embers up, you will not have a primary. No automatic failover will occur as there aren't enough votes for an election.
Don't have an even number of nodes in a replica set. It increases the chances that there will be problems, just because there are more servers, without increasing the failure tolerance of the set. With 3 or 4 replica set members, 2 down servers will render the set unhealthy.
I have a replica set having three members, with host0:27100 as a primary member. Recently i changed the configuration and made the host2:27102 as primary member. Followed these docs.
After changing the configuratio, the rs.status() output says that the host1:27101 is "syncingTo" : "host2:27102" which is intended.
But the output for new primary host2:27102 shows it is "syncingTo" : "host0:27100" which is the previous primary member, and changed into secondary.
I cannot understand why its syncing to the secondary member. Is it a normal behavior?
s0:SECONDARY> rs.status()
{
"set" : "s0",
"date" : ISODate("2013-09-25T12:31:42Z"),
"myState" : 2,
"syncingTo" : "host2:27102",
"members" : [
{
"_id" : 0,
"name" : "host0:27100",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 428068,
"optime" : Timestamp(1380112272, 1),
"optimeDate" : ISODate("2013-09-25T12:31:12Z"),
"self" : true
},
{
"_id" : 1,
"name" : "host1:27101",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 397,
"optime" : Timestamp(1380112272, 1),
"optimeDate" : ISODate("2013-09-25T12:31:12Z"),
"lastHeartbeat" : ISODate("2013-09-25T12:31:42Z"),
"lastHeartbeatRecv" : ISODate("2013-09-25T12:31:41Z"),
"pingMs" : 10,
"syncingTo" : "host2:27102"
},
{
"_id" : 2,
"name" : "host2:27102",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 397,
"optime" : Timestamp(1380112272, 1),
"optimeDate" : ISODate("2013-09-25T12:31:12Z"),
"lastHeartbeat" : ISODate("2013-09-25T12:31:42Z"),
"lastHeartbeatRecv" : ISODate("2013-09-25T12:31:41Z"),
"pingMs" : 2,
"syncingTo" : "host0:27100"
}
],
"ok" : 1
}
This is a known issue. There is an open ticket about rs.status() showing the primary as syncingTo when run from a secondary if the current primary was a secondary in the past ( SERVER-9989 ). Fix verion is 2.5.1