LVM snapshot restored node is added as member with invalid config error - mongodb

Below is the issue :-
--> Lets say NODE 2 is restored using lvm sanpshot backup of primary
--> If mongodb is running on primary, lets say NODE1. Then Node2 is added as member of replica set. Then below error is thrown
REPL [ReplicationExecutor] Error in heartbeat request to NODE2 ; InvalidReplicaSetConfig: Our replica set configuration is invalid or
does not include us
and primary NODE1 starts initial sync with secondary despite of fact that data is already present in NODE 2 data directory and NODE2 is in startup phase
---> If mongodb is running on primary, lets say NODE1.We shutdown NODE1 using **
db.shutdowServer({force:true}) command.
**
Then, We, start NODE1 and add NODE2 as secondary, It becomes Secondary and inital SYNC is avoided, but throw below error
**
018-01-19T12:57:50.229-0500 I NETWORK [thread1] connection accepted
from NODE2:60570 #49 (5 connections now open)
2018-01-19T12:57:50.229-0500 I REPL [ReplicationExecutor] Error in
heartbeat request to NODE2:27021; InvalidReplicaSetConfig: Our replica
set configuration is invalid or does not include us
2018-01-19T12:57:50.230-0500 I REPL [ReplicationExecutor] Error in
heartbeat request to NODE2:27021; InvalidReplicaSetConfig: Our replica
set configuration is invalid or does not include us
2018-01-19T12:57:50.231-0500 I - [conn49] end connection NODE2:60570
(6 connections now open) 2018-01-19T12:57:50.231-0500 I NETWORK
[thread1] connection accepted from NODE2:60574 #50 (5 connections now
open) 2018-01-19T12:57:50.231-0500 I REPL [ReplicationExecutor] Error
in heartbeat request to NODE2:27021; InvalidReplicaSetConfig: Our
replica set configuration is invalid or does not include us
2018-01-19T12:57:50.233-0500 I - [conn50] end connection NODE2:60574
(6 connections now open) 2018-01-19T12:57:50.236-0500 I NETWORK
[thread1] connection accepted from NODE2 #51 (5 connections now open)
2018-01-19T12:57:50.237-0500 I - [conn51] end connection NODE2:60578
(6 connections now open) 2018-01-19T12:57:52.232-0500 I REPL
[ReplicationExecutor] Member NODE2:27021 is now in state SECONDARY
**
The question is why this error is comming in mongo always and and what can be done to avoid that.
**
REPL [ReplicationExecutor] Error in heartbeat request to NODE2 ;
InvalidReplicaSetConfig: Our replica set configuration is invalid or
does not include us
**
Is there a way i can add node as secondary without doing shutdownServer command and intialsync using LVM SNAPSHOT BACKUP

Related

MongoDB secondary replica set sync failure with Primary (in the reporting replica)

In the mongoDB secondary replica (reporting) I am getting the sync failure and also below error messages after 30 Hrs of the traffic run.
2020-07-15T01:50:29.987+0000 W EXECUTOR [conn408277] Terminating session due to error: InternalError: failed to create service entry worker thread
2020-07-15T01:50:31.656+0000 I - [listener] pthread_create failed: Resource temporarily unavailable
Can anyone help on how to debug & fix the issue please

Postgresql WalReceiver process waits on connecting master regardless of "connect_timeout"

I am trying to deploy an automated high-available PostgreSQL cluster on kubernetes. In cases of master failover or temporary failures in master, standby loses streaming replication connection and when retrying, it takes a long time until it gets failed and retries.
I use PostgreSQL 10 and streaming replication (cluster-main-cluster-master-service is a service that always routes to master and all the replicas connect to this service for replication). I've tried setting configs like connect_timeout and keepalive in primary_conninfo of recovery.conf and wal_receiver_timeout in postgresql.conf of standby but I could not make any progress with them.
In the first place when master goes down, replication stops with the following error (state 1):
2019-10-06 14:14:54.042 +0330 [3039] LOG: replication terminated by primary server
2019-10-06 14:14:54.042 +0330 [3039] DETAIL: End of WAL reached on timeline 17 at 0/33000098.
2019-10-06 14:14:54.042 +0330 [3039] FATAL: could not send end-of-streaming message to primary: no COPY in progress
2019-10-06 14:14:55.534 +0330 [12] LOG: record with incorrect prev-link 0/2D000028 at 0/33000098
After investigating Postgres activities I found out that WalReceiver proccess stucks in LibPQWalReceiverConnect wait_event (state 2) but timeout is way longer than what I configured (although I set connect_timeout to 10 seconds, it takes about 2 minutes). Then, It fails with the following error (state 3):
2019-10-06 14:17:06.035 +0330 [3264] FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "cluster-main-cluster-master-service" (192.168.0.166) and accepting
TCP/IP connections on port 5432?
In the next try, It successfully connects the primary (state 4):
2019-10-06 14:17:07.892 +0330 [5786] LOG: started streaming WAL from primary at 0/33000000 on timeline 17
I also tried killing the process when stuck event occurs (state 2), and when I do, It starts the process again and connects and then streams normally (jumps to state 4).
After checking netstat, I also found that there is a connection with SYN_SENT state to the old master in the walreceiver process (in failover case).
connect_timeout governs how long PostgreSQL will wait for the replication connection to succeed, but that does not include establishing the TCP connection.
To reduce the time that the kernel waits for a successful answer to a TCP SYN request, reduce the number of retries. In /etc/sysctl.conf, set:
net.ipv4.tcp_syn_retries = 3
and run sysctl -p.
That should reduce the time significantly.
Reducing the value too much might make your system less stable.

MongoDB Replica Set election fails with "no replset config has been received" response from peers

I'm trying to establish a replica set with three members (one of which is an arbiter). For technical reasons the members must access each other using SSH tunnelling. I am reasonably confident this is configured correctly as on all the mongodb host's I am able to connect to the other nodes using mongo by providing the relevant --host and --port parameters. When I initiate the replica set on what I'd like to be the primary, the logs show the "initiator" successfully connecting to the other members:
REPL [ReplicationExecutor] transition to RECOVERING
REPL [ReplicationExecutor] transition to SECONDARY
REPL [ReplicationExecutor] Member 10.x.x.1:27017 is now in state STARTUP
REPL [ReplicationExecutor] Member 10.x.x.2:27017 is now in state STARTUP
REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
However, the other members refuse to vote as they don't have any configuration for the replica set
REPL [ReplicationExecutor] VoteRequester: Got error processing response with status: BadValue: Unexpected field info in ReplSetRequestVotes, resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got no vote from 10.16.10.4:30000 because: , resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got error processing response with status: BadValue: Unexpected field info in ReplSetRequestVotes, resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] VoteRequester: Got no vote from 10.16.10.4:27018 because: , resp:{ info: "run rs.initiate(...) if not yet done for the set", ok: 0.0, errmsg: "no replset config has been received", code: 94 }
REPL [ReplicationExecutor] not running for primary, we received insufficient votes
This process repeats every electionTimeoutMillis.
running rs.status() on the intiator of the replica set gives a suspicious time for the last heartbeat received from each member
> rs.status()
...
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z")
...
I'm not sure what's cause and effect here. Will a member of the replica set only receive the configuration after the "initiator" has received the heartbeat response? Is there a way to force the initiator to send the configuration to the other members?

MongoDB stops when I try to make a connection

Mongodb stops whenever I try to make a connection.
When I run
sudo service mongod start
I get a message that mongodb is running.
But then when I try to make a connection using PyMongo I get an error that says
Autoreconnect: connection closed
I check my mongodb status:
sudo service mongod status
And it says that my mongodb instance is stopped/waiting.
My mongo log file reports the following:
2015-09-17T18:19:46.816+0000 I NETWORK [initandlisten] waiting for connections on port 7000
2015-09-17T18:19:58.813+0000 I NETWORK [initandlisten] connection accepted from 54.152.111.120:51387 #1 (1 connection now open)
2015-09-17T18:19:58.816+0000 I STORAGE [conn1] _getOpenFile() invalid file index requested 4
2015-09-17T18:19:58.816+0000 I - [conn1] Invariant failure false src/mongo/db/storage/mmap_v1/mmap_v1_extent_manager.cpp 201
2015-09-17T18:19:58.837+0000 I CONTROL [conn1]
This is followed by a lengthy backtrace that I can't decipher, then closes with:
2015-09-17T18:19:58.837+0000 I - [conn1]
***aborting after invariant() failure
I've looked around SO, particularly trying the top two answers here, but haven't been able to figure out how to solve the problem.
I'll also note that last time I tried to connect last week, it was working fine.

mongodb keyFile between replicas throws Permission denied

I have a single node ReplicaSet with auth activated, a root user and a keyFile I've created with this tutorial, I also have two more mongod processes in the same server in different ports (37017 and 47017) and the same replSet name, but when I try to add the secondaries in the mongo shell connected to PRIMARY with rs.add("172.31.48.41:37017") I get:
{
"ok" : 0,
"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: 172.31.48.41:27017; the following nodes did not respond affirmatively: 172.31.48.41:37017 failed with Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed",
"code" : 74
}
Then I went to the mongod process log of the PRIMARY and found out this:
2015-05-19T20:53:59.848-0400 I REPL [conn51] replSetReconfig admin command received from client
2015-05-19T20:53:59.848-0400 W NETWORK [conn51] Failed to connect to 172.31.48.41:37017, reason: errno:13 Permission denied
2015-05-19T20:53:59.848-0400 I REPL [conn51] replSetReconfig config object with 2 members parses ok
2015-05-19T20:53:59.849-0400 W NETWORK [ReplExecNetThread-0] Failed to connect to 172.31.48.41:37017, reason: errno:13 Permission denied
2015-05-19T20:53:59.849-0400 W REPL [ReplicationExecutor] Failed to complete heartbeat request to 172.31.48.41:37017; Location18915 Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed
2015-05-19T20:53:59.849-0400 E REPL [conn51] replSetReconfig failed; NodeNotFound Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: 172.31.48.41:27017; the following nodes did not respond affirmatively: 172.31.48.41:37017 failed with Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed
And the log of the mongod that should become SECONDARY shows nothing, the last two lines are:
2015-05-19T20:48:36.584-0400 I REPL [initandlisten] Did not find local replica set configuration document at startup; NoMatchingDocument Did not find replica set configuration document in local.system.replset
2015-05-19T20:48:36.591-0400 I NETWORK [initandlisten] waiting for connections on port 37017
It's clear that I cannot rs.initiate() in this node because it will self vote to be PRIMARY and that would create a conflict, so the line that states "Did not find local replica set configuration document at startup" is to be ignores as far as I know.
So I would think that the permission should be ok since I'm using the same key file in every mongod process and the replSet is the same in every config file, and that's all the tutorial states to be needed, but obviously something is missing.
Any ideas? Is this a bug?
If you are using ec2 instances and ip 27017 port in security group for both instances, just add a secondary instance port. It worked for me.