Initialise mongodb replica set never completing - mongodb

I am trying to intialise a mongodb replica set but whenever I add the new node it never makes it past state 3 (RECOVERING). Here is a snapshot from rs.status():
rs0:OTHER> rs.status()
{
"set" : "rs0",
"date" : ISODate("2015-04-27T14:09:21.973Z"),
"myState" : 3,
"members" : [
{
"_id" : 0,
"name" : "10.0.1.184:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 6899,
"optime" : Timestamp(1430143759, 9),
"optimeDate" : ISODate("2015-04-27T14:09:19Z"),
"lastHeartbeat" : ISODate("2015-04-27T14:09:20.133Z"),
"lastHeartbeatRecv" : ISODate("2015-04-27T14:09:20.160Z"),
"pingMs" : 0,
"electionTime" : Timestamp(1430127299, 1),
"electionDate" : ISODate("2015-04-27T09:34:59Z"),
"configVersion" : 109483
},
{
"_id" : 1,
"name" : "10.0.1.119:27017",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 6899,
"lastHeartbeat" : ISODate("2015-04-27T14:09:20.133Z"),
"lastHeartbeatRecv" : ISODate("2015-04-27T14:09:20.166Z"),
"pingMs" : 0,
"configVersion" : 109483
},
{
"_id" : 2,
"name" : "10.0.1.179:27017",
"health" : 1,
"state" : 3,
"stateStr" : "RECOVERING",
"uptime" : 15651,
"optime" : Timestamp(1430136863, 2),
"optimeDate" : ISODate("2015-04-27T12:14:23Z"),
"infoMessage" : "could not find member to sync from",
"configVersion" : 109483,
"self" : true
}
],
"ok" : 1
}
Occasionally the infoMessage "could not find member to sync from" is visible on the new node. I note that the oplog on the current primary is only 0.12 hours (1.7GB) and that it is taking approx. 2 hours to copy over the majority of the dataset (as seen by network usage).
Is it correct to assume that the oplog must be greater than this 2 hour period for the initial sync to complete successfully?

It was indeed necessary for the oplog to be LARGER (in time) than the expected time to synchronise the data between two replicas. Disk is cheap so we increased our OPLOG to 50GB and restarted the sync, worked first time.

Related

Resync a Mongo Replica Set

I'm having a replica set, and to free some disk space, I want to resync my replica set members.
Thus, on the SECONDARY member of the replica set, I've emptied the /var/lib/mongodb/ directory which holds the data for the database.
When I open a shell to the Replication Set, and execute the command rs.status(), the following is showed.
{
"set" : "rs1",
"date" : ISODate("2016-12-13T08:28:00.414Z"),
"myState" : 5,
"term" : NumberLong(29),
"heartbeatIntervalMillis" : NumberLong(2000),
"members" : [
{
"_id" : 0,
"name" : "10.20.2.87:27017",
"health" : 1.0,
"state" : 5,
"stateStr" : "SECONDARY",
"uptime" : 148,
"optime" : {
"ts" : Timestamp(6363490787761586, 1),
"t" : NumberLong(29)
},
"optimeDate" : ISODate("2016-12-13T07:54:16.000Z"),
"infoMessage" : "could not find member to sync from",
"configVersion" : 3,
"self" : true
},
{
"_id" : 1,
"name" : "10.20.2.95:27017",
"health" : 1.0,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 146,
"optime" : {
"ts" : Timestamp(6363490787761586, 1),
"t" : NumberLong(29)
},
"optimeDate" : ISODate("2016-12-13T07:54:16.000Z"),
"lastHeartbeat" : ISODate("2016-12-13T08:27:58.435Z"),
"lastHeartbeatRecv" : ISODate("2016-12-13T08:27:59.447Z"),
"pingMs" : NumberLong(0),
"electionTime" : Timestamp(6363486827801739, 1),
"electionDate" : ISODate("2016-12-13T07:38:54.000Z"),
"configVersion" : 3
},
{
"_id" : 2,
"name" : "10.20.2.93:30001",
"health" : 1.0,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 146,
"lastHeartbeat" : ISODate("2016-12-13T08:27:58.437Z"),
"lastHeartbeatRecv" : ISODate("2016-12-13T08:27:59.394Z"),
"pingMs" : NumberLong(0),
"configVersion" : 3
}
],
"ok" : 1.0
}
Why does my secondary member shows `Could not find member to sync from, however, my primary is up and running."
My collection is sharded, over 6 servers, and I have this message on 2 replica set members. The ones which have the SECONDARY member on top in the members array when requesting the replication set status.
I really would like to get rid of this error message.
It scares me :-)
Kind regards
I had a similar problem, and it was due to the fact that the heartbeat timeout was too short, you can solve that problem here

How to know when replica set initial sync completed

From the MongoDB documentation:
At this point, the mongod will perform an initial sync. The length of the initial sync process depends on the size of the database and network connection between members of the replica set.
Source
My question in very simple, how can I know when it's safe to stepDown the PRIMARY member of my replica set? I just upgrated my secondary to use WiredTiger.
Output of rs.status():
{
"set" : "m0",
"date" : ISODate("2015-03-18T09:59:21.486Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "example.com",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 4642,
"optime" : Timestamp(1426672500, 1),
"optimeDate" : ISODate("2015-03-18T09:55:00Z"),
"electionTime" : Timestamp(1426668268, 1),
"electionDate" : ISODate("2015-03-18T08:44:28Z"),
"configVersion" : 7,
"self" : true
},
{
"_id" : 1,
"name" : "example.com"",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 1309,
"optime" : Timestamp(1426672500, 1),
"optimeDate" : ISODate("2015-03-18T09:55:00Z"),
"lastHeartbeat" : ISODate("2015-03-18T09:59:20.968Z"),
"lastHeartbeatRecv" : ISODate("2015-03-18T09:59:20.762Z"),
"pingMs" : 0,
"syncingTo" : "example.com"",
"configVersion" : 7
},
{
"_id" : 2,
"name" : "example.com"",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 4640,
"lastHeartbeat" : ISODate("2015-03-18T09:59:21.009Z"),
"lastHeartbeatRecv" : ISODate("2015-03-18T09:59:21.238Z"),
"pingMs" : 59,
"configVersion" : 7
}
],
"ok" : 1
}
Found the solution:
While performing the inital sync, the status is RECOVERING

MongoDB replicaset configuration

Could you please tell me if this will cause any issues with failover? For example, what would happen if host mongo2.local is down? (assuming the original host and the arbiter go down and only 2 members are left). Will the rest of the members be able to elect a new primary ever?
I know that there shouldn't be an arbiter here as it makes things worse but I wanted to know if a failover will occur in case of this setup and mongo2.local go down.
mongo:ARBITER> rs.status()
{
"set" : "mongo",
"date" : ISODate("2015-02-12T09:00:08Z"),
"myState" : 7,
"members" : [
{
"_id" : 0,
"name" : "mongo1.local:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2572473,
"optime" : Timestamp(1423731603, 4),
"optimeDate" : ISODate("2015-02-12T09:00:03Z"),
"lastHeartbeat" : ISODate("2015-02-12T09:00:07Z"),
"lastHeartbeatRecv" : ISODate("2015-02-12T09:00:07Z"),
"pingMs" : 0,
"syncingTo" : "mongo2.local:27017"
},
{
"_id" : 1,
"name" : "mongo2.local:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 12148099,
"optime" : Timestamp(1423731603, 4),
"optimeDate" : ISODate("2015-02-12T09:00:03Z"),
"lastHeartbeat" : ISODate("2015-02-12T09:00:08Z"),
"lastHeartbeatRecv" : ISODate("2015-02-12T09:00:08Z"),
"pingMs" : 0,
"electionTime" : Timestamp(1423711411, 1),
"electionDate" : ISODate("2015-02-12T03:23:31Z")
},
{
"_id" : 2,
"name" : "mongo3.local:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 5474488,
"optime" : Timestamp(1423731603, 4),
"optimeDate" : ISODate("2015-02-12T09:00:03Z"),
"lastHeartbeat" : ISODate("2015-02-12T09:00:07Z"),
"lastHeartbeatRecv" : ISODate("2015-02-12T09:00:08Z"),
"pingMs" : 139,
"syncingTo" : "mongo2.local:27017"
},
{
"_id" : 3,
"name" : "mongo2.local:27020",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 12148101,
"self" : true
}
],
"ok" : 1
}
and:
mongo:ARBITER> rs.config()
{
"_id" : "mongo",
"version" : 5,
"members" : [
{
"_id" : 0,
"host" : "mongo1.local:27017",
"priority" : 0.5
},
{
"_id" : 1,
"host" : "mongo2.local:27017"
},
{
"_id" : 2,
"host" : "mongo3.local:27017",
"priority" : 0.5
},
{
"_id" : 3,
"host" : "mongo2.local:27020",
"arbiterOnly" : true
}
]
}
If you have less than a majority of the votes in a replica set available, the replica set will not be able to elect or maintain a primary and the replica set will be unhealthy and will be read-only. Ergo, if you only have 2 of your 4 embers up, you will not have a primary. No automatic failover will occur as there aren't enough votes for an election.
Don't have an even number of nodes in a replica set. It increases the chances that there will be problems, just because there are more servers, without increasing the failure tolerance of the set. With 3 or 4 replica set members, 2 down servers will render the set unhealthy.

mongo secondary has no queries after recovery

I have a test case, a sharding cluster with 1 shard.
The shard is rs, which has 1 primary and 2 secondaries.
My application uses secondaryPreferred policy, at first the queries balanced over two secondaries. Then I stop 1 secondary 10.160.243.22 to simulate fault, and then reboot it, the status is ok:
rs10032:PRIMARY> rs.status()
{
"set" : "rs10032",
"date" : ISODate("2014-12-05T09:21:07Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "10.160.243.22:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2211,
"optime" : Timestamp(1417771218, 3),
"optimeDate" : ISODate("2014-12-05T09:20:18Z"),
"lastHeartbeat" : ISODate("2014-12-05T09:21:05Z"),
"lastHeartbeatRecv" : ISODate("2014-12-05T09:21:07Z"),
"pingMs" : 0,
"lastHeartbeatMessage" : "syncing to: 10.160.188.52:27017",
"syncingTo" : "10.160.188.52:27017"
},
{
"_id" : 1,
"name" : "10.160.188.52:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 2211,
"optime" : Timestamp(1417771218, 3),
"optimeDate" : ISODate("2014-12-05T09:20:18Z"),
"electionTime" : Timestamp(1417770837, 1),
"electionDate" : ISODate("2014-12-05T09:13:57Z"),
"self" : true
},
{
"_id" : 2,
"name" : "10.160.189.52:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 2209,
"optime" : Timestamp(1417771218, 3),
"optimeDate" : ISODate("2014-12-05T09:20:18Z"),
"lastHeartbeat" : ISODate("2014-12-05T09:21:07Z"),
"lastHeartbeatRecv" : ISODate("2014-12-05T09:21:06Z"),
"pingMs" : 0,
"syncingTo" : "10.160.188.52:27017"
}
],
"ok" : 1
}
but all queries go to another secondary 10.160.188.52, and 10.160.243.22 is idle
Why the queries not balanced to two secondaries after recovery and how to fix it ?
Your application uses some kind of driver(I don't know exact technology stack you are using) to connect to MongoDb. Your driver could remember(cache) replica set status or connections for some period of time. So, there is no guarantee that secondary node will be available immediately after a recovery.

MongoDB secondaries not catching up

I have a replica set that I am trying to upgrade the primary to one with more memory and upgraded disk space. So I raided a couple disks together on the new primary, rsync'd the data from a secondary and added it to the replica set. After checking out the rs.status(), I noticed that all the secondaries are at about 12 hours behind the primary. So when I try to force the new server to the primary spot it won't work, because it is not up to date.
This seems like a big issue, because in case the primary fails, we are at least 12 hours and some almost 48 hours behind.
The oplogs all overlap and the oplogsize is fairly large. The only thing that I can figure is I am performing a lot of writes/reads on the primary, which could keep the server in lock, not allowing for proper catch up.
Is there a way to possibly force a secondary to catch up to the primary?
There are currently 5 Servers with last 2 are to replace 2 of the other nodes.
The node with _id as 6, is to be the one to replace the primary. The node that is the furthest from the primary optime is a little over 48 hours behind.
{
"set" : "gryffindor",
"date" : ISODate("2011-05-12T19:34:57Z"),
"myState" : 2,
"members" : [
{
"_id" : 1,
"name" : "10******:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 20231,
"optime" : {
"t" : 1305057514000,
"i" : 31
},
"optimeDate" : ISODate("2011-05-10T19:58:34Z"),
"lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
},
{
"_id" : 2,
"name" : "10******:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 20231,
"optime" : {
"t" : 1305056009000,
"i" : 400
},
"optimeDate" : ISODate("2011-05-10T19:33:29Z"),
"lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
},
{
"_id" : 3,
"name" : "10******:27018",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 20229,
"optime" : {
"t" : 1305228858000,
"i" : 422
},
"optimeDate" : ISODate("2011-05-12T19:34:18Z"),
"lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
},
{
"_id" : 5,
"name" : "10*******:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 20231,
"optime" : {
"t" : 1305058009000,
"i" : 226
},
"optimeDate" : ISODate("2011-05-10T20:06:49Z"),
"lastHeartbeat" : ISODate("2011-05-12T19:34:56Z")
},
{
"_id" : 6,
"name" : "10*******:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" : {
"t" : 1305050495000,
"i" : 384
},
"optimeDate" : ISODate("2011-05-10T18:01:35Z"),
"self" : true
}
],
"ok" : 1
}
I'm not sure why the syncing has failed in your case, but one way to brute force a resync is to remove the data files on the replica and restart the mongod. It will initiate a resync. See http://www.mongodb.org/display/DOCS/Halted+Replication. It is likely to take quite some time, dependent on the size of your database.
After looking through everything I saw a single error, which led me back to a mapreduce that was run on the primary, which had this issue: https://jira.mongodb.org/browse/SERVER-2861 . So when replication was attempted it failed to sync because of a faulty/corrupt operation in the oplog.