Strange thing about mongodb-erlang driver when using replica set - mongodb

My code is like this:
Replset = {<<"rs1">>, [{localhost, 27017}, {localhost, 27018}, {localhost, 27019}]},
Conn_Pool = resource_pool:new (mongo:rs_connect_factory(Replset), 10),
...
Conn = resource_pool:get(Conn_Pool)
case mongo:do(safe, master, Conn, ?DATABASE,
fun() ->
mongo:insert(mytable, {'_id', 26, d, 11})
end end)
...
27017 is the primary node, so ofc I can insert the data successfully.
But, when I put only one secondary node in the code instead of all of mongo rs instances: Replset = {<<"rs1">>, [{localhost, 27019}]}, I can also insert the data.
I thought it should have thrown exception or error, but it had written the data successfully.
why that happened?

When you connect to a replica set, you specify the name of the replSet and some of the node names as seeds. The driver connects to the seed nodes in turn and discovers the real replica set membership/config/status via 'db.isMaster()' command.
Since it discovers which node is the primary that way, it is able to then route all your write requests accordingly. The same technique is what enables it to automatically failover to the newly elected primary when the original primary fails and a new one is elected.

Related

MongoDB Resync Failure

We have a shard server with 4 shard PSA Architecture. The overall DB size is around 5Tb. And one of the shard secondary service have failed we started resyc from primary.
We are facing an issue when i am trying to resync data from a primary to secondary.
MongoDB Version 4.0.18
DataSize for that Shard: 571Gb
Oplog Size : Deault
Error Message:
2020-10-06T08:57:57.165+0530 I REPL [replication-339] We are too stale to use host:port as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1601947649, 446) is before their earliest timestamp: Timestamp(1601951946, 330) for 1min until: 2020-10-06T08:58:57.165+0530
You need to do an initial sync to the dead node. See https://docs.mongodb.com/manual/core/replica-set-sync/#initial-sync.

mongoimport loading only 1000 rows on sharding

I have a mongo sharding setup configuration like
6 config server
3 shard server (with replica)
6 router
for example:
**s1->s2 (one shard with replicat (primary:s1,secondry:s2))
s3->s4 (2nd shard with replics (primary s3, secondry s4))
s5->s6 (third shard with replics (primary s5, secondry s6))
config, router is on all server i.e s1 to s6**
I am not able to import data to one of the empty sharded collection , data is in csv format.
I m running mongoimport in background and the nohup out shows like this
**2017-01-10T17:13:18.444+0530 [........................] dbname.collectionname 364.0 KB/46.1 MB (0.8%)**
mongoimport is stuck , how to fix this.
I first tried to run mongoimport on s2 but not succeeded then try to run mongoimport on s1 no success
follwing are the errors servers from routerlog , configuration log
**HostnameCanonicalizationWorker
[rsBackgroundSync] we are too stale to use **** as a
sync source
REPL [ReplicationExecutor] could not find member to sync from
REPL [ReplicationExecutor] The liveness timeout does not match callback handle, so not resetting it.
REPL [rsBackgroundSync] too stale to catch up -- entering maintenance mode**

Data loss due to unexpected failover of MongoDB replica set

So I encountered the following issue recently:
I have a 5-member set replica set (priority)
1 x primary (2)
2 x secondary (0.5)
1 x hidden backup (0)
1 x arbiter (0)
One of the secondary replicas with 0.5 priority (let's call it B) encountered some network issue and had intermittent connectivity with the rest of the replica set. However, despite having staler data and a lower priority than the existing primary (let's call it A) it assumed primary role:
[ReplicationExecutor] VoteRequester: Got no vote from xxx because: candidate's data is staler than mine, resp:{ term: 29, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
[ReplicationExecutor] election succeeded, assuming primary role in term 29
[ReplicationExecutor] transition to PRIMARY
And for A, despite not having any connection issues with the rest of the replica set:
[ReplicationExecutor] stepping down from primary, because a new term has begun: 29
So Question 1 is, how could this have been possible given the circumstances?
Moving on, A (now a secondary) began rolling back data:
[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 28, timestamp: xxx). source's GTE: (term: 29, timestamp: xxx) hashes: (xxx/xxx)
[rsBackgroundSync] beginning rollback
[rsBackgroundSync] rollback 0
[ReplicationExecutor] transition to ROLLBACK
This caused data which was written to be removed. So Question 2 is: How does an OplogStart go missing?
Last but not least, Question 3, how can this be prevented?
Thank you in advance!
You are using version 3.2.x and protocolVersion=1 (you can check it with rs.conf() -command)? Because there is "bug" on voting.
You can prevent this bug by (choose one or both):
change protocolVersion to 0.
cfg = rs.conf();
cfg.protocolVersion=0;
rs.reconfig(cfg);
change all priorities to same value
EDIT:
These are tickets what explain.. More or less..
Ticket 1
Ticket 2

MongoS sharding metadata manager failed asking for instance is manually reset

My MongoS servers are not staring they are sending this error in logs.
SHARDING [Balancer] caught exception while doing balance: Server's
sharding metadata manager failed to initialize and will remain in
this state until the instance is manually reset :: caused by ::
HostNotFound: unable to resolve DNS for host confserv_1.xyz.com
2016-05-02T17:57:06.612+0530 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "DB2255-2016-05-02T17:57:06.611+0530-5727479aa1051c5fb04fcc49", server: "mongoS1", clientAddr: "", time: new Date(1462192026611), what: "balancer.round", ns: "", details: { executionTimeMillis: 35, errorOccured: true, errmsg: "Server's sharding metadata manager failed to initialize and will remain in this state until the instance is manually reset :: caused by :: HostNotFoun..." } }
When I connect config server using host name it is working fine.
I tried to restart MongoS server it is not coming up.
I check Mongo code and found this error mentioned in
https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/sharding_state.cpp
/ TODO: remove after v3.4.
// This is for backwards compatibility with old style initialization through metadata
// commands/setShardVersion. As well as all assignments to _initializationStatus and
// _setInitializationState_inlock in this method.
if (_getInitializationState() == InitializationState::kInitializing) {
auto waitStatus = _waitForInitialization_inlock(deadline, lk);
if (!waitStatus.isOK()) {
return waitStatus;
}
}
if (_getInitializationState() == InitializationState::kError) {
return {ErrorCodes::ManualInterventionRequired,
str::stream() << "Server's sharding metadata manager failed to initialize and will "
"remain in this state until the instance is manually reset"
<< causedBy(_initializationStatus)};
}
But it does not mention anything what manual intervention is required.
Current Mongo version is 3.2.6
I just ran into this problem while trying to harden the security configuration. As in your case, I was able to connect to the config servers from all mongos instances.
In my case I was also testing a case with members of replica sets being in different datacenters, and I had the problem only after steppingDown some primaries.
I noticed at the end that, not as the error message is pretending, the issue was happening on some primaries of one datacenter, who were not able to route back to the config server. After fixing the routing problem (/etc/hosts eventually), no more problems occurred on the mongo side.

SocketException in Mongo

I just set up a replica set in Mongo (prod environment). I'm now getting a lot of exceptions like below (clipped).
I went into mongo and ran a serverStatus command on my primary mongo node and only have about 300 connections going, so it's hardly working.
Below are my connection option settings in my server code:
auto_connect_retry = false
connections_per_host = 10
threads_multiplier = 10
max_wait_time = 120000
connect_timeout = 10000
socket_timeout = 0
Do I have something mis-configured?
Sep 9, 2013 8:31:26 PM com.mongodb.DBPortPool gotError
WARNING: emptying DBPortPool to /10.0.8.10:27017 b/c of error
java.net.SocketException: Connection timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:146)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at org.bson.io.Bits.readFully(Bits.java:46)
at org.bson.io.Bits.readFully(Bits.java:33)
at org.bson.io.Bits.readFully(Bits.java:28)
at com.mongodb.Response.<init>(Response.java:40)
at com.mongodb.DBPort.go(DBPort.java:142)
at com.mongodb.DBPort.call(DBPort.java:92)
at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:244)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:216)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:288)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:273)
at com.mongodb.DBCollection.findOne(DBCollection.java:347)
at com.mongodb.DBCollection.findOne(DBCollection.java:332)
at com.mongodb.casbah.MongoCollectionBase$class.findOneByID(MongoCollection.scala:232)
at com.mongodb.casbah.MongoCollection.findOneByID(MongoCollection.scala:866)
at com.novus.salat.dao.SalatDAO.findOneById(SalatDAO.scala:353)
at com.novus.salat.dao.ModelCompanion$class.findOneById(ModelCompanion.scala:173)
Generally a connection timeout occurs from one of the following in a replica set
1) All members are not able to communicate with each other
2) A program is connecting to replica for update and it is unable to send it to primary due to overload or 1st as well
3) All relicas are not in sync and one is lagging behind too much
4) Leader election is going on but not completed due to some reason
Please check if your relica set is consistent and all nodes are working by issuing rs.status() on primary node , also as earlier suggested check primary logs for more information