We have a setup of two mongodb shards. Each shard contains a master, a slave, a 24h slave delay slave and an arbiter.
However the balancer fails to migrate any shards waiting for the delayed slave to migrate.
I have tried setting _secondaryThrottle to false in the balancer config, but I still have the issue.
It seems the migration goes on for a day and then fails (A ton of waiting for slave messages in the logs). Eventually it gives up and starts a new migration. The message says waiting for 3 slaves, but the delay slave is hidden and prio 0 so it should wait for that one. And if the _secondaryThrottle worked it should not wait for any slave right?
It's been like this for a few months now so the config should have been reloaded on all mongoses. Some of the mongoses running the balancer have been restarter recently.
Does anyone have any idea how to solve the problem, we did not have these issues before starting the delayed slave, but it's just our theory.
Config:
{ "_id" : "balancer", "_secondaryThrottle" : false, "stopped" : false }
Log from shard1 master process:
[migrateThread] warning: migrate commit waiting for 3 slaves for
'xxx.xxx' { shardkey: ObjectId('4fd2025ae087c37d32039a9e') } ->
{shardkey: ObjectId('4fd2035ae087c37f04014a79') } waiting for:
529dc9d9:7a [migrateThread] Waiting for replication to catch up before
entering critical section
Log from shard2 master process:
Tue Dec 3 14:52:25.302 [conn1369472] moveChunk data transfer
progress: { active: true, ns: "xxx.xxx", from:
"shard2/mongo2:27018,mongob2:27018", min: { shardkey:
ObjectId('4fd2025ae087c37d32039a9e') }, max: { shardkey:
ObjectId('4fd2035ae087c37f04014a79') }, shardKeyPattern: { shardkey:
1.0 }, state: "catchup", counts: { cloned: 22773, clonedBytes: 36323458, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Update:
I confirmed that removing slaveDelay got the balancer working again. As soon as they got up to speed chunks moved. So the problem seems to be related to the slaveDelay. I also confirmed that the balancer runs with "secondaryThrottle" : false. It does seem to wait for slaves anyway.
Shard2:
Tue Dec 10 11:44:25.423 [migrateThread] warning: migrate commit waiting for 3 slaves for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') } waiting for: 52a6f089:81
Tue Dec 10 11:44:26.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:27.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:28.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:29.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:30.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.425 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.647 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.667 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
The balancer is properly waiting for the MAJORITY of the replica set of the destination shard to have the documents being migrated before initiating the delete of those documents on the source shard.
The issue is that you have FOUR members in your replica set (master, a slave, a 24h slave delay slave and an arbiter). That means three is the majority. I'm not sure why you added an arbiter, but if you remove it, then TWO will be the majority and the balancer will not have to wait for the delayed slave.
The alternate way of achieving the same result is to set up the delayed slave with votes:0 property and leave the arbiter as the third voting node.
What version are you running? There is a known bug in 2.4.2 and below, as well as 2.2.4 and below that causes an incorrect count of the number of secondaries in the set (and hence makes it impossible to satisfy the default w:majority write for the migration). This is the bug (fixed in 2.4.3+ and 2.2.5+):
https://jira.mongodb.org/browse/SERVER-8420
Turning off the secondary throttle should be a valid workaround, but you may want to do a flushRouterConfig on any mongos processes (or just restart all the mongos processes) to make sure the setting is taking effect for your migrations, especially if they are taking a day to time out. As another potential fix prior to upgrade, you can also drop the local.slaves collection (it will be recreated).
Related
Setup: replica set with 5 nodes, version 3.4.5.
Trying to switch PRIMARY with rs.stepDown(60, 30) but consistently getting the error:
rs0:PRIMARY> rs.stepDown(60, 30)
{
"ok" : 0,
"errmsg" : "No electable secondaries caught up as of 2017-07-11T00:21:11.205+0000. Please use {force: true} to force node to step down.",
"code" : 50,
"codeName" : "ExceededTimeLimit"
}
However, rs.printSlaveReplicationInfo() running in a parallel terminal confirms that all replicas are fully caught up:
rs0:PRIMARY> rs.printSlaveReplicationInfo()
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
Am I doing something wrong?
UPD: I've checked long running operations before and during rs.stepDown as was suggested below and it looks like this:
# Before rs.stepDown
$ watch "mongo --quiet --eval 'JSON.stringify(db.currentOp())' | jq -r '.inprog[] | \"\(.secs_running) \(.desc) \(.op)\"' | sort -rnk1"
984287 rsSync none
984287 ReplBatcher none
67 WT RecordStoreThread: local.oplog.rs none
null SyncSourceFeedback none
null NoopWriter none
0 conn615153 command
0 conn614948 update
0 conn614748 getmore
...
# During rs.stepDown
984329 rsSync none
984329 ReplBatcher none
108 WT RecordStoreThread: local.oplog.rs none
16 conn615138 command
16 conn615136 command
16 conn615085 update
16 conn615079 insert
...
Basically, long running user operations seem to happen as a result of rs.stepDown() as secs_running becomes nonzero once PRIMARY attempts to switch over and keeps growing all the way up until stepDown fails. Then everything gets back to normal.
Any ideas on why this happens and whether that's normal at all?
I have used below command to step down to secondary
db.adminCommand( { replSetStepDown: 120, secondaryCatchUpPeriodSecs: 15, force: true } )
You can find this in below mongodb official documentation
https://docs.mongodb.com/manual/reference/command/replSetStepDown/
To close the loop on this question, it was determined that the failed stepdown was due to time going backward on the host.
MongoDB 3.4.6 is more resilient to time issues on the host, and upgrading the deployment fixes the stalling issues.
Before stepping down, rs.stepDown() will attempt to terminate long running user operations that would block the primary from stepping down, such as an index build, a write operation or a map-reduce job.
Do you have some long running jobs on going? Check db. Check result of db.currentOp()
You can try to set longer stepping down time rs.stepDown(60, 360).
Quoting an answer from https://jira.mongodb.org/browse/SERVER-27015:
This is most likely due to the fact that by default the shutdown command will only succeed on a primary if the secondaries are fully caught up at the exact moment that the shutdown command is executed.
I faced a similar issue and tried the db.shutdownServer() command several times, however it worked exactly when the secondary was 0 seconds behind the primary.
I have a 3 node replicas mongo cluster. I managed to start first two nodes but the thrd one it's failing with:
[rsBackgroundSync] starting rollback: OplogStartMissing our last op time fetched: (term: 33, timestamp: Jan 22 09:34:52:1). source's GTE: (term: 34, timestamp: Jan 22 09:35:25:1) hashes: (-9060984734961038872/2476820215102251535)
2017-01-22T14:01:51.206+0000 F REPL [rsBackgroundSync] need to rollback, but in inconsistent state
2017-01-22T14:01:51.206+0000 I - [rsBackgroundSync] Fatal assertion 28723 UnrecoverableRollbackError need to rollback, but in inconsistent state. minvalid: (term: 38, timestamp: Jan 22 11:13:01:1) > our last optime: (term: 33, timestamp: Jan 22 09:34:52:1) # 18750
I made a mongodump from Primary and remove this third replica (mongoreplica3) from the replicaset and restore it, but after I tried to set back the node ion replica set it's still failing with the same error.
Any idea how can I manually sync and start this mongoreplica3 with my replicaset?
This was solved by removing everything from /data and start the mongoreplica which got synced with the Primary after.
After I restarted my sharded cluster I noticed the balancer was not migrating any data anymore but the command sh.isBalancerRunning() always returned true.
I tried to to run the command sh.stopBalancer() and it stuck forever on:
sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Checking on the config server locks here is the data:
configsvr> db.locks.find({_id: "balancer"})
{ "_id" : "balancer", "process" : "myserver.mongodb.com:27017:1452776409:1804289383",
"state" : 2, "ts" : ObjectId("56cb817f2c4edd1226d6ae07"), "when" : ISODate("2016-02-22T21:45:35.360Z"), "who" : "myserver.mongodb.com:27017:1452776409:1804289383:Balancer:846930886",
"why" : "doing balance round" }
Also, if I try to run sh.startBalancer() it times out:
mongos> sh.startBalancer()
2016-02-23T22:51:11.204-0500 E QUERY [thread1] Error: assert.soon failed, msg:Waited too long for lock balancer to change to state undefined :
doassert#src/mongo/shell/assert.js:15:14
assert.soon#src/mongo/shell/assert.js:200:13
sh.waitForDLock#src/mongo/shell/utils_sh.js:171:1
sh.waitForBalancer#src/mongo/shell/utils_sh.js:264:9
sh.startBalancer#src/mongo/shell/utils_sh.js:146:5
#(shell):1:1
in the sh.status():
balancer:
Currently enabled: yes
Currently running: yes
Balancer lock taken at Mon Feb 22 2016 16:45:35 GMT-0500 (EST) by myserver.mongodb.com:27017:1452776409:1804289383:Balancer:846930886
Balancer active window is set between 8:00 and 6:00 server local time
Failed balancer rounds in last 5 attempts: 5
Last reported error: Connection refused
Time of Reported error: Tue Feb 23 2016 17:27:26 GMT-0500 (EST)
Migration Results for the last 24 hours:
No recent migrations
I have tried restarting the servers, stepping down primaries, changing the locks balancer state to 0 and running sh.startBalancer() and removing the balancer field from the locks db and trying to run sh.startBalancer() again with no results.
At the end it was an issue with the server clocks been out of sync, for some reason the logs about this issue didn't appear until the next day.
Hope this helps someone with a similar issue :)
I am running a conventional MongoDB Replica Set consisting of 3 members (member1 in datacenter A, member2 and member3 in datacenter B).
member1 is the current PRIMARY and I am adding members 2 and 3 via rs.add(). They are performing their initial sync and become SECONDARY very soon. Everything is fine all day long and the replication delay of both members is 0 seconds until 2 AM at nighttime.
Now: Every night at 2 AM both members shift into the RECOVERING state and stop replication at all, which leads to a replication delay of hours when I am having a look into rs.printSlaveReplicationInfo() in the morning hours. At around 2 AM there are no massive inserts or maintenance tasks known to me.
I get the following log entries on the PRIMARY:
2015-10-09T01:59:38.914+0200 [initandlisten] connection accepted from 192.168.227.209:59905 #11954 (37 connections now open)
2015-10-09T01:59:55.751+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.869+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.870+0200 [conn11111] getmore local.oplog.rs cursorid:1155433944036 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:32168 nreturned:0 reslen:20 134ms
2015-10-09T01:59:55.872+0200 [conn11111] end connection 192.168.227.209:58972 (36 connections now open)
And, which is more interesting, I get the following log entries on both SECONDARYs:
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] replSet syncing to: member1:27017
2015-10-09T01:59:56.065+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from member1:27017
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet our last optime : Oct 9 01:59:23 5617035b:17f
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet oldest at member1:27017 : Oct 9 01:59:23 5617035b:1af
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet RECOVERING
Which is also striking - the start of the oplog "resets" itself every night at around 2 AM:
configured oplog size: 990MB
log length start to end: 19485secs (5.41hrs)
oplog first event time: Fri Oct 09 2015 02:00:33 GMT+0200 (CEST)
oplog last event time: Fri Oct 09 2015 07:25:18 GMT+0200 (CEST)
now: Fri Oct 09 2015 07:25:26 GMT+0200 (CEST)
I am not sure if this is somehow correlated to the issue. I am also wondering that such a small delay (Oct 9 01:59:23 5617035b:17f <-> Oct 9 01:59:23 5617035b:1af) lets the members become stale.
Could this also be a server (VM host) time issue or is it something completely different? (Why is the first oplog event being "resetted" every night and not "shifting" to a timestamp like NOW minus 24 hrs?)
What can I do to investigate and to avoid?
Upping the oplog size should solve this (per our comments).
Some references for others who run into this issue
Workloads that Might Require a Larger Oplog Size
Error: replSet error RS102 too stale to catch up link1 & link2
I see that the MongoDB documentation says that removing index is by calling db.accounts.dropIndex( { "tax-id": 1 } ). But it does not say whether the node needs to be removed from the replicaset or not.
I tried to take a secondary node in a replicaset offline and restart as a standalone node (in a different port) and tried to drop the index.
But after bringing back the node in the replica set with regular process sudo service mongod start, the mongod process is dying saying the index got corrupted.
Thu Oct 31 19:52:38.098 [repl writer worker 1] Assertion: 15898:error in index possibly corruption consider repairing 382
0xdddd81 0xd9f55b 0xd9fa9c 0x7edb83 0x7fb332 0x7fdc08 0x9d3b50 0x9c796e 0x9deb64 0xac45dd 0xac58df 0xa903fa 0xa924c7 0xa71f6c 0xc273d3 0xc26b18 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd9f55b]
/usr/bin/mongod() [0xd9fa9c]
/usr/bin/mongod(_ZN5mongo11checkFailedEj+0x143) [0x7edb83]
/usr/bin/mongod(_ZNK5mongo12BucketBasicsINS_12BtreeData_V1EE11basicInsertENS_7DiskLocERiS3_RKNS_5KeyV1ERKNS_8OrderingE+0x222) [0x7fb332]
/usr/bin/mongod(_ZNK5mongo11BtreeBucketINS_12BtreeData_V1EE10insertHereENS_7DiskLocEiS3_RKNS_5KeyV1ERKNS_8OrderingES3_S3_RNS_12IndexDetailsE+0x68) [0x7fdc08]
/usr/bin/mongod(_ZNK5mongo30IndexInsertionContinuationImplINS_12BtreeData_V1EE22doIndexInsertionWritesEv+0xa0) [0x9d3b50]
/usr/bin/mongod(_ZN5mongo14IndexInterface13IndexInserter19finishAllInsertionsEv+0x1e) [0x9c796e]
/usr/bin/mongod(_ZN5mongo24indexRecordUsingTwoStepsEPKcPNS_16NamespaceDetailsENS_7BSONObjENS_7DiskLocEb+0x754) [0x9deb64]
/usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbbPb+0x123d) [0xac45dd]
/usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEbb+0x4f) [0xac58df]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x2eda) [0xa903fa]
/usr/bin/mongod(_ZN5mongo27updateObjectsForReplicationEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0xa924c7]
/usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x65c) [0xa71f6c]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc273d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc26b18]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.106 [repl writer worker 1] ERROR: writer worker caught exception: error in index possibly corruption consider repairing 382 on:
xxxxxxxx--deleted content related to the data...xxxxxxxxxxxxx
Thu Oct 31 19:52:38.106 [repl writer worker 1] Fatal Assertion 16360
0xdddd81 0xd9dc13 0xc26bfc 0xdab721 0xe26609 0x7ff4d05f0c6b 0x7ff4cf9965ed
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd9dc13]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib64/libpthread.so.0(+0x7c6b) [0x7ff4d05f0c6b]
/lib64/libc.so.6(clone+0x6d) [0x7ff4cf9965ed]
Thu Oct 31 19:52:38.108 [repl writer worker 1]
***aborting after fassert() failure
Thu Oct 31 19:52:38.108 Got signal: 6 (Aborted).
Is this due to dropping the index in the offline mode on the secondary? Any suggestions on the proper way to drop the index is highly appreciated.
The proper way to remove index from replica set is to drop it on primary. The idea of replica is having the same copy of data (with small time lags). So whenever you do something on primary is copied to the secondaries. So if you start doing anything on the primary, right after it finishes this process, the process propagates to secondaries.
If you are removing index from primary - the index will be removed on the secondary as well.