After I restarted my sharded cluster I noticed the balancer was not migrating any data anymore but the command sh.isBalancerRunning() always returned true.
I tried to to run the command sh.stopBalancer() and it stuck forever on:
sh.stopBalancer()
Waiting for active hosts...
Waiting for the balancer lock...
Checking on the config server locks here is the data:
configsvr> db.locks.find({_id: "balancer"})
{ "_id" : "balancer", "process" : "myserver.mongodb.com:27017:1452776409:1804289383",
"state" : 2, "ts" : ObjectId("56cb817f2c4edd1226d6ae07"), "when" : ISODate("2016-02-22T21:45:35.360Z"), "who" : "myserver.mongodb.com:27017:1452776409:1804289383:Balancer:846930886",
"why" : "doing balance round" }
Also, if I try to run sh.startBalancer() it times out:
mongos> sh.startBalancer()
2016-02-23T22:51:11.204-0500 E QUERY [thread1] Error: assert.soon failed, msg:Waited too long for lock balancer to change to state undefined :
doassert#src/mongo/shell/assert.js:15:14
assert.soon#src/mongo/shell/assert.js:200:13
sh.waitForDLock#src/mongo/shell/utils_sh.js:171:1
sh.waitForBalancer#src/mongo/shell/utils_sh.js:264:9
sh.startBalancer#src/mongo/shell/utils_sh.js:146:5
#(shell):1:1
in the sh.status():
balancer:
Currently enabled: yes
Currently running: yes
Balancer lock taken at Mon Feb 22 2016 16:45:35 GMT-0500 (EST) by myserver.mongodb.com:27017:1452776409:1804289383:Balancer:846930886
Balancer active window is set between 8:00 and 6:00 server local time
Failed balancer rounds in last 5 attempts: 5
Last reported error: Connection refused
Time of Reported error: Tue Feb 23 2016 17:27:26 GMT-0500 (EST)
Migration Results for the last 24 hours:
No recent migrations
I have tried restarting the servers, stepping down primaries, changing the locks balancer state to 0 and running sh.startBalancer() and removing the balancer field from the locks db and trying to run sh.startBalancer() again with no results.
At the end it was an issue with the server clocks been out of sync, for some reason the logs about this issue didn't appear until the next day.
Hope this helps someone with a similar issue :)
Related
We had an outage on one of our PSQL 14 (managed by Zalando) due to k8s control plane being unreachable for 30min.
Control plane is now ok but master PSQL does not want to start:
LOG,00000,"listening on IPv4 address ""0.0.0.0"", port 5432"
LOG,00000,"listening on IPv6 address ""::"", port 5432"
LOG,00000,"listening on Unix socket ""/var/run/postgresql/.s.PGSQL.5432"""
LOG,00000,"database system was shut down at 2023-01-30 02:51:10 UTC"
WARNING,01000,"specified neither primary_conninfo nor restore_command",,"The database server will regularly poll the pg_wal subdirectory to check for files placed there."
LOG,00000,"entering standby mode"
FATAL,XX000,"requested timeline 5 is not a child of this server's history","Latest checkpoint is at 2/82000028 on timeline 4, but in the history of the requested timeline, the server forked off from that timeline at 0/530000A0."
LOG,00000,"startup process (PID 23007) exited with exit code 1"
LOG,00000,"aborting startup due to startup process failure"
LOG,00000,"database system is shut down"
We can see in archive_status folder:
-rw-------. 1 postgres postgres 0 Jan 30 02:51 000000040000000200000081.ready
-rw-------. 1 postgres postgres 0 Jan 30 02:51 00000005.history.done
Would you know how we can recover safely from this?
I guess switching back to timeline 4 would be enough as timeline 5 was made after start of outage.
The server is started in standby mode. Remove standby.signal if you want to start the server as primary server.
I have IBM WebSphere Application 8.5 server work with Db2 11.1 works from 2 years. Since a month the Application server hangs, the dB CPU goes to 0 and the application server CPU go to >80 , and hang after nearly 24 hour the same problem repeats every day. with logs on app server
db2diag Error today
2020-12-09-10.03.24.732486+120 I1234525159E610 LEVEL: Error
PID : 5737 TID : 139739072030464 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : WPJCR
APPHDL : 0-38161 APPID: ::ffff:x.42258.201209075007
UOWID : 199 ACTID: 1
AUTHID : DB2INST1 HOSTNAME: ERTUWCMDB1Az
EDUID : 1760 EDUNAME: db2agent (WPJCR) 0
FUNCTION: DB2 UDB, common communication, sqlcctest, probe:50
MESSAGE : sqlcctest RC
DATA #1 : Hexdump, 2 bytes
0x00007F1789BFCDE0 : 3600 6.
2020-12-09-10.03.24.732661+120 I1234525770E601 LEVEL: Error
PID : 5737 TID : 139739072030464 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : WPJCR
APPHDL : 0-38161 APPID: ::ffff:x.42258.201209075007
UOWID : 199 ACTID: 1
AUTHID : DB2INST1 HOSTNAME: ERTUWCMDB1Az
EDUID : 1760 EDUNAME: db2agent (WPJCR) 0
FUNCTION: DB2 UDB, base sys utilities, sqeAgent::AgentBreathingPoint, probe:10
CALLED : DB2 UDB, common communication, sqlcctest
RETCODE : ZRC=0x00000036=54
[11/3/20 6:42:13:596 EET] 000006ad XATransaction E J2CA0027E: An
exception occurred while invoking rollback on an XA Resource Adapter
from DataSource jdbc/wpjcrdbDS, within transaction ID {XidImpl:
formatId(57415344), gtrid_length(36), bqual_length(54),
data(000001758c648aa7000000082a775800f8c220c5f6bdab92156eae0be31e28ea7605ade8000001758c648aa7000000082a775800f8c220c5f6bdab92156eae0be31e28ea7605ade8000000010000000000000000000000000001)}
: com.ibm.db2.jcc.am.XaException: [jcc][t4][2041][12326][4.25.13]
Error executing XAResource.rollback(). Server returned XAER_NOTA.
ERRORCODE=-4203, SQLSTATE=null
After a while the dB CPU goes to 0 and the application server CPU go to >80 and hang after nearly 24 hour the same problem repeats.
is this deadlock or locktimeout due to data corruption??
Without seeing any other app server logs, the combination of you noting that
"nearly 24 hour the problem repeats"
the sqeAgent::AgentBreathingPoint error (see IBM technote
https://www.ibm.com/support/pages/what-does-agentbreathingpoint-error-mean-db2 for more info)
the "works from 2 years. Since a month the Application server hangs"
would lead me to look for a change in your network where an connection timeout has been set recently, closing connections after 24 hours. This can be caused by replacing a router or upgrading firmware where settings are different. Does this occur at about the same time everyday and if so, is it occurring as the app goes from a quiet state (like overnight) to a busy state (like start of a workday)?
Based on your answer, it sounds like the entire connection pool is becoming "stale" overnight, meaning the connections are not being used and a network timeout is causing them to become disconnected from the db server. You can try changing the WAS datasource settings for "Minimum connections" to 0 and the "Unused Timeout" to perhaps 12 hours. This will allow the connection pool to drain overnight as the server traffic quiesces. As the app load starts in the morning, new connections will be obtained, avoiding the errors. If your "Maximum Connections" settings is very large, you may experience some slowness as the connection pool is being filled.
Setup: replica set with 5 nodes, version 3.4.5.
Trying to switch PRIMARY with rs.stepDown(60, 30) but consistently getting the error:
rs0:PRIMARY> rs.stepDown(60, 30)
{
"ok" : 0,
"errmsg" : "No electable secondaries caught up as of 2017-07-11T00:21:11.205+0000. Please use {force: true} to force node to step down.",
"code" : 50,
"codeName" : "ExceededTimeLimit"
}
However, rs.printSlaveReplicationInfo() running in a parallel terminal confirms that all replicas are fully caught up:
rs0:PRIMARY> rs.printSlaveReplicationInfo()
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
source: X.X.X.X:27017
syncedTo: Tue Jul 11 2017 00:21:11 GMT+0000 (UTC)
0 secs (0 hrs) behind the primary
Am I doing something wrong?
UPD: I've checked long running operations before and during rs.stepDown as was suggested below and it looks like this:
# Before rs.stepDown
$ watch "mongo --quiet --eval 'JSON.stringify(db.currentOp())' | jq -r '.inprog[] | \"\(.secs_running) \(.desc) \(.op)\"' | sort -rnk1"
984287 rsSync none
984287 ReplBatcher none
67 WT RecordStoreThread: local.oplog.rs none
null SyncSourceFeedback none
null NoopWriter none
0 conn615153 command
0 conn614948 update
0 conn614748 getmore
...
# During rs.stepDown
984329 rsSync none
984329 ReplBatcher none
108 WT RecordStoreThread: local.oplog.rs none
16 conn615138 command
16 conn615136 command
16 conn615085 update
16 conn615079 insert
...
Basically, long running user operations seem to happen as a result of rs.stepDown() as secs_running becomes nonzero once PRIMARY attempts to switch over and keeps growing all the way up until stepDown fails. Then everything gets back to normal.
Any ideas on why this happens and whether that's normal at all?
I have used below command to step down to secondary
db.adminCommand( { replSetStepDown: 120, secondaryCatchUpPeriodSecs: 15, force: true } )
You can find this in below mongodb official documentation
https://docs.mongodb.com/manual/reference/command/replSetStepDown/
To close the loop on this question, it was determined that the failed stepdown was due to time going backward on the host.
MongoDB 3.4.6 is more resilient to time issues on the host, and upgrading the deployment fixes the stalling issues.
Before stepping down, rs.stepDown() will attempt to terminate long running user operations that would block the primary from stepping down, such as an index build, a write operation or a map-reduce job.
Do you have some long running jobs on going? Check db. Check result of db.currentOp()
You can try to set longer stepping down time rs.stepDown(60, 360).
Quoting an answer from https://jira.mongodb.org/browse/SERVER-27015:
This is most likely due to the fact that by default the shutdown command will only succeed on a primary if the secondaries are fully caught up at the exact moment that the shutdown command is executed.
I faced a similar issue and tried the db.shutdownServer() command several times, however it worked exactly when the secondary was 0 seconds behind the primary.
We have a setup of two mongodb shards. Each shard contains a master, a slave, a 24h slave delay slave and an arbiter.
However the balancer fails to migrate any shards waiting for the delayed slave to migrate.
I have tried setting _secondaryThrottle to false in the balancer config, but I still have the issue.
It seems the migration goes on for a day and then fails (A ton of waiting for slave messages in the logs). Eventually it gives up and starts a new migration. The message says waiting for 3 slaves, but the delay slave is hidden and prio 0 so it should wait for that one. And if the _secondaryThrottle worked it should not wait for any slave right?
It's been like this for a few months now so the config should have been reloaded on all mongoses. Some of the mongoses running the balancer have been restarter recently.
Does anyone have any idea how to solve the problem, we did not have these issues before starting the delayed slave, but it's just our theory.
Config:
{ "_id" : "balancer", "_secondaryThrottle" : false, "stopped" : false }
Log from shard1 master process:
[migrateThread] warning: migrate commit waiting for 3 slaves for
'xxx.xxx' { shardkey: ObjectId('4fd2025ae087c37d32039a9e') } ->
{shardkey: ObjectId('4fd2035ae087c37f04014a79') } waiting for:
529dc9d9:7a [migrateThread] Waiting for replication to catch up before
entering critical section
Log from shard2 master process:
Tue Dec 3 14:52:25.302 [conn1369472] moveChunk data transfer
progress: { active: true, ns: "xxx.xxx", from:
"shard2/mongo2:27018,mongob2:27018", min: { shardkey:
ObjectId('4fd2025ae087c37d32039a9e') }, max: { shardkey:
ObjectId('4fd2035ae087c37f04014a79') }, shardKeyPattern: { shardkey:
1.0 }, state: "catchup", counts: { cloned: 22773, clonedBytes: 36323458, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Update:
I confirmed that removing slaveDelay got the balancer working again. As soon as they got up to speed chunks moved. So the problem seems to be related to the slaveDelay. I also confirmed that the balancer runs with "secondaryThrottle" : false. It does seem to wait for slaves anyway.
Shard2:
Tue Dec 10 11:44:25.423 [migrateThread] warning: migrate commit waiting for 3 slaves for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') } waiting for: 52a6f089:81
Tue Dec 10 11:44:26.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:27.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:28.423 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:29.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:30.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] Waiting for replication to catch up before entering critical section
Tue Dec 10 11:44:31.424 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.425 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.647 [migrateThread] migrate commit succeeded flushing to secondaries for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
Tue Dec 10 11:44:31.667 [migrateThread] migrate commit flushed to journal for 'xxx.xxx' { shardkey: ObjectId('4ff1213ee087c3516b2f703f') } -> { shardkey: ObjectId('4ff12a5eddf2b32dff1e7bea') }
The balancer is properly waiting for the MAJORITY of the replica set of the destination shard to have the documents being migrated before initiating the delete of those documents on the source shard.
The issue is that you have FOUR members in your replica set (master, a slave, a 24h slave delay slave and an arbiter). That means three is the majority. I'm not sure why you added an arbiter, but if you remove it, then TWO will be the majority and the balancer will not have to wait for the delayed slave.
The alternate way of achieving the same result is to set up the delayed slave with votes:0 property and leave the arbiter as the third voting node.
What version are you running? There is a known bug in 2.4.2 and below, as well as 2.2.4 and below that causes an incorrect count of the number of secondaries in the set (and hence makes it impossible to satisfy the default w:majority write for the migration). This is the bug (fixed in 2.4.3+ and 2.2.5+):
https://jira.mongodb.org/browse/SERVER-8420
Turning off the secondary throttle should be a valid workaround, but you may want to do a flushRouterConfig on any mongos processes (or just restart all the mongos processes) to make sure the setting is taking effect for your migrations, especially if they are taking a day to time out. As another potential fix prior to upgrade, you can also drop the local.slaves collection (it will be recreated).
We have a mongodb with 2 shardings, each of the shardings has those servers:
Shard 1: Master, running MongoD and Config server
Shard 1-s1: Slave, running MongoD and MongoS server
Shard 1-s2: Slave, running MongoD and MongoS and Arbiter server
Shard 2: Master, running MongoD and Config Server
Shard 2-s1: Slave, running MongoD and Config and MongoS server
Shard 2-s2: Slave, running MongoD and MongoS and Arbiter server
But the mongodb allways failed in recent days, after days of search, i find out that the MongoD runs at Shard 1(Master) always going down after reviced too many connections, other MongoD don't have this problem.
When the S1Master's MongoD running with too many connections for about 2 hours, the 4 Mongos Server will shut down one by one. Here is the Mongos's error log(10.81.4.72:7100 runs MongoD):
Tue Aug 20 20:01:52 [conn8526] DBClientCursor::init call() failed
Tue Aug 20 20:01:52 [conn3897] ns: user.dev could not initialize cursor across all shards because : stale config detected for ns: user.
dev ParallelCursor::_init # s01/10.36.31.36:7100,10.42.50.24:7100,10.81.4.72:7100 attempt: 0
Tue Aug 20 20:01:52 [conn744] ns: user.dev could not initialize cursor across all shards because : stale config detected for ns: user.d
ev ParallelCursor::_init # s01/10.36.31.36:7100,10.42.50.24:7100,10.81.4.72:7100 attempt: 0
I don't know why this mongod revieved so many connections, the chunks shows the sharding works well.