Let's say I have a three server setup. Two servers store data, and a server is an arbiter.
Last week, my 'PRIMARY' server went down, and as expected the 'SECONDARY' was promoted and things continued working as expected.
However, I'm now debugging another bug in my application that I think might be related to this change in the replication setup.
Is there any way I can find out (from the logs or whatnot) WHEN exactly the election occurred?
You can find in the logs (of the new 'PRIMARY') the following lines:
2018-08-02T03:56:49.817+0000 I REPL [ReplicationExecutor] Standing for election
2018-08-02T03:56:49.818+0000 I REPL [ReplicationExecutor] not electing self, ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal:27017 has same OpTime as us: { : Timestamp 1533182831000|1 }
2018-08-02T03:56:49.818+0000 I REPL [ReplicationExecutor] possible election tie; sleeping 445ms until 2018-08-02T03:56:50.263+0000
2018-08-02T03:56:50.263+0000 I REPL [ReplicationExecutor] Standing for election
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] not electing self, ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal:27017 has same OpTime as us: { : Timestamp 1533182831000|1 }
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] running for election; slept last election, so running regardless of possible tie
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] received vote: 1 votes from ip-xxx-xxx-xxx-xxx.eu-west-1.compute.internal:27017
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] election succeeded, assuming primary role
2018-08-02T03:56:50.265+0000 I REPL [ReplicationExecutor] transition to PRIMARY
You can see the election took place at 3:56am UTC.
I advise you use the less tool to search in your log file:
less /var/log/mongodb/mongod.log
Then, navigate at the end of the file using G, then search backward with ?, and search for 'Standing for election'.
Related
I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.
Below is the issue :-
--> Lets say NODE 2 is restored using lvm sanpshot backup of primary
--> If mongodb is running on primary, lets say NODE1. Then Node2 is added as member of replica set. Then below error is thrown
REPL [ReplicationExecutor] Error in heartbeat request to NODE2 ; InvalidReplicaSetConfig: Our replica set configuration is invalid or
does not include us
and primary NODE1 starts initial sync with secondary despite of fact that data is already present in NODE 2 data directory and NODE2 is in startup phase
---> If mongodb is running on primary, lets say NODE1.We shutdown NODE1 using **
db.shutdowServer({force:true}) command.
**
Then, We, start NODE1 and add NODE2 as secondary, It becomes Secondary and inital SYNC is avoided, but throw below error
**
018-01-19T12:57:50.229-0500 I NETWORK [thread1] connection accepted
from NODE2:60570 #49 (5 connections now open)
2018-01-19T12:57:50.229-0500 I REPL [ReplicationExecutor] Error in
heartbeat request to NODE2:27021; InvalidReplicaSetConfig: Our replica
set configuration is invalid or does not include us
2018-01-19T12:57:50.230-0500 I REPL [ReplicationExecutor] Error in
heartbeat request to NODE2:27021; InvalidReplicaSetConfig: Our replica
set configuration is invalid or does not include us
2018-01-19T12:57:50.231-0500 I - [conn49] end connection NODE2:60570
(6 connections now open) 2018-01-19T12:57:50.231-0500 I NETWORK
[thread1] connection accepted from NODE2:60574 #50 (5 connections now
open) 2018-01-19T12:57:50.231-0500 I REPL [ReplicationExecutor] Error
in heartbeat request to NODE2:27021; InvalidReplicaSetConfig: Our
replica set configuration is invalid or does not include us
2018-01-19T12:57:50.233-0500 I - [conn50] end connection NODE2:60574
(6 connections now open) 2018-01-19T12:57:50.236-0500 I NETWORK
[thread1] connection accepted from NODE2 #51 (5 connections now open)
2018-01-19T12:57:50.237-0500 I - [conn51] end connection NODE2:60578
(6 connections now open) 2018-01-19T12:57:52.232-0500 I REPL
[ReplicationExecutor] Member NODE2:27021 is now in state SECONDARY
**
The question is why this error is comming in mongo always and and what can be done to avoid that.
**
REPL [ReplicationExecutor] Error in heartbeat request to NODE2 ;
InvalidReplicaSetConfig: Our replica set configuration is invalid or
does not include us
**
Is there a way i can add node as secondary without doing shutdownServer command and intialsync using LVM SNAPSHOT BACKUP
In my mongodb replicaSet, I found one of the secondary node is down, and when I check the db.log, I found this:
I REPL [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
I REPL [ReplicationExecutor] syncing from: primary-node-ip:portNum
I REPL [SyncSourceFeedback] replset setting syncSourceFeedback to primary-node-ip:portNum
I REPL [rsBackgroundSync] replSet our last op time fetched: Nov 25 05:41:01:85
I REPL [rsBackgroundSync] replset source's GTE: Nov 25 05:41:02:1
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
I REPL [rsBackgroundSync] minvalid: 5a1891f0:187 our last optime: 5a1891ed:28
I - [rsBackgroundSync] Fatal Assertion 18750
I - [rsBackgroundSync]
***aborting after fassert() failure
I googled, but don't really find any page to get this 18750 fatal assertion thing clearly.
the mongoDB version is 3.0
You didn't say the MongoDB version you're using, but that particular assertion can be traced back to MongoDB 3.0 series.
Particularly, the cause of the assertion is printed in the logs you posted:
F REPL [rsBackgroundSync] replSet need to rollback, but in inconsistent state
This message was printed by this part of the source code: https://github.com/mongodb/mongo/blob/v3.0/src/mongo/db/repl/rs_rollback.cpp#L837-L841
What that message means is that the node needs to perform a rollback, but it discovered that it is unable to do so because it is in an inconsistent state (e.g. no rollback can be performed).
One possible cause of this issue is unreliable network connection between the replica set and the application, and also between the replica set nodes themselves, although the exact cause may be different between one deployment and another.
Please see Rollbacks During Replica Set Failover for more information regarding rollbacks.
Unfortunately there's not much that can be done in this case except doing a resync process of the asserting node. Please see Resync a Member of a Replica Set for details on how to do so.
I'm trying to establish a replica set with three members (one of which is an arbiter). For technical reasons the members must access each other using SSH tunnelling. I am reasonably confident this is configured correctly as on all the mongodb host's I am able to connect to the other nodes using mongo by providing the relevant --host and --port parameters. When I initiate the replica set on what I'd like to be the primary, the logs show the "initiator" successfully connecting to the other members:
REPL [ReplicationExecutor] transition to RECOVERING
REPL [ReplicationExecutor] transition to SECONDARY
REPL [ReplicationExecutor] Member 10.x.x.1:27017 is now in state STARTUP
REPL [ReplicationExecutor] Member 10.x.x.2:27017 is now in state STARTUP
REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
However, the other members refuse to vote as they don't have any configuration for the replica set
REPL [ReplicationExecutor] VoteRequester: Got error processing response with status: BadValue: Unexpected field info in ReplSetRequestVotes, resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got no vote from 10.16.10.4:30000 because: , resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got error processing response with status: BadValue: Unexpected field info in ReplSetRequestVotes, resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got no vote from 10.16.10.4:27018 because: , resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] not running for primary, we received insufficient votes
This process repeats every electionTimeoutMillis.
running rs.status() on the intiator of the replica set gives a suspicious time for the last heartbeat received from each member
> rs.status()
...
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z")
...
I'm not sure what's cause and effect here. Will a member of the replica set only receive the configuration after the "initiator" has received the heartbeat response? Is there a way to force the initiator to send the configuration to the other members?
I have a single node ReplicaSet with auth activated, a root user and a keyFile I've created with this tutorial, I also have two more mongod processes in the same server in different ports (37017 and 47017) and the same replSet name, but when I try to add the secondaries in the mongo shell connected to PRIMARY with rs.add("172.31.48.41:37017") I get:
{
"ok" : 0,
"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: 172.31.48.41:27017; the following nodes did not respond affirmatively: 172.31.48.41:37017 failed with Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed",
"code" : 74
}
Then I went to the mongod process log of the PRIMARY and found out this:
2015-05-19T20:53:59.848-0400 I REPL [conn51] replSetReconfig admin command received from client
2015-05-19T20:53:59.848-0400 W NETWORK [conn51] Failed to connect to 172.31.48.41:37017, reason: errno:13 Permission denied
2015-05-19T20:53:59.848-0400 I REPL [conn51] replSetReconfig config object with 2 members parses ok
2015-05-19T20:53:59.849-0400 W NETWORK [ReplExecNetThread-0] Failed to connect to 172.31.48.41:37017, reason: errno:13 Permission denied
2015-05-19T20:53:59.849-0400 W REPL [ReplicationExecutor] Failed to complete heartbeat request to 172.31.48.41:37017; Location18915 Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed
2015-05-19T20:53:59.849-0400 E REPL [conn51] replSetReconfig failed; NodeNotFound Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: 172.31.48.41:27017; the following nodes did not respond affirmatively: 172.31.48.41:37017 failed with Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed
And the log of the mongod that should become SECONDARY shows nothing, the last two lines are:
2015-05-19T20:48:36.584-0400 I REPL [initandlisten] Did not find local replica set configuration document at startup; NoMatchingDocument Did not find replica set configuration document in local.system.replset
2015-05-19T20:48:36.591-0400 I NETWORK [initandlisten] waiting for connections on port 37017
It's clear that I cannot rs.initiate() in this node because it will self vote to be PRIMARY and that would create a conflict, so the line that states "Did not find local replica set configuration document at startup" is to be ignores as far as I know.
So I would think that the permission should be ok since I'm using the same key file in every mongod process and the replSet is the same in every config file, and that's all the tutorial states to be needed, but obviously something is missing.
Any ideas? Is this a bug?
If you are using ec2 instances and ip 27017 port in security group for both instances, just add a secondary instance port. It worked for me.