Problems with clustered Mongodb 4.x multi-document transactions - mongodb

I have a problem with MongoDB cluster when it comes to use multi-document transactions.
I have a 5 mongodb replicaset servers divided into three data centers. Two of them are in the first datacenter, two in the second, and one in the third datacenter (arbiter). At a time one of the servers is primary and third others are slave replicas.
I wrote an application in java using Spring Boot and I masivelly use multi-document transactions in mongo.
Everything works fine when all the db servers are up. But I when I wanted to test high-availability by eliminating one of the datacenters I encountered strange problems. My application started to hang on each transaction (I need to wait about a minute, and then I get timeout from database), but still works fine when transactions are not used :-(.
Below is the exception i get in my application:
2020-07-06 16:58:51.748 ERROR 6 --- [0.0-5555-exec-4] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.data.mongodb.MongoTransactionException: Query failed with error code 251 and error message 'Encountered non-retryable error during query :: caused by :: Transaction 4 has been aborted.' on server mongos-dc2.fake.domain.com:27017; nested exception is com.mongodb.MongoQueryException: Query failed with error code 251 and error message 'Encountered non-retryable error during query :: caused by :: Transaction 4 has been aborted.' on server mongos-dc2.fake.domain.com27017]
What could be the reason of that? Could you tell me what should I do to fix this behaviour?

I figured that the guilty one is the the mongo arbiter.
https://docs.mongodb.com/manual/tutorial/add-replica-set-arbiter/
As I described, four of my mongo servers are oridinary replicas (two in first datacenter and two in the second datacenter), but the last one (in third datacenter) is the arbiter, which role is only to vote in case of election of new primary.
I changed that node type from arbiter to normal replica and everything started to work as expected :/
That still doesn't anwser my question, because I have no idea why something that was supposed to work, the official mongo feature, just failed. But probably this is the question to Mongo team more than a topic to stackoverflow.
Nevertheless I hope that this will help someone

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

MongoDB clients stopped connecting to secondary nodes of the replicaset

Application stopped connecting to the mongodb secondary replicaset. We have the read preference set to secondary.
mongodb://db0.example.com,db1.example.com,db2.example.com/?replicaSet=myRepl&readPreference=secondary&maxStalenessSeconds=120
Connections always go to the primary overloading the primary node. This issue started after restarting patching and restart of the servers.
Tried mongo shell connectvity using above resulting in command being abruptly terminated. I see the process for that connect in the server in ps -ef|grep mongo
Any one faced this issue? Any troubleshooting tips are appreciated. Log's aren't showing anything related to the terminated/stopped connection process.
We were able to fix the issue. It was an issue on the spring boot side. When the right bean (we have two beans - one for primary and one for secondary connections) was injected, the connection was established to the secondary node for heavy reading and reporting purposes.

How to handle socket error connection closed exception while loading huge data using SpringData MongoDB repository.saveAll()

I'm using springdata-mongodb & loading huge amount of data. While loading we're sometime seeing
Closed connection [connectionId{localValue:10, serverValue:111926}] to *.gcp.mongodb.net:27017 because there was a socket exception raised by this connection
Here's what we're doing
Running a simple app in some 20 pods inside GCP
All apps trying to MongoRepository.saveAll({huge list})
A scheduler is invoked to read data{domain object} from an instance variable and store into the DB
We did not specify any batch size assuming springdata-mongodb will take care of it for us
These documents have id, so basically we're trying to upsert
No two pods will have documents with duplicate ids, the app takes care of it
Most of the pods run successfully, but atleast 2 or 3 pods seem to face the above connection issue
Our connection string in application.yml looks like below:
data:
mongodb:
uri: mongodb+srv://${MONGODB_USER}:${MONGODB_SECRET}#*.gcp.mongodb.net/my-db-name?retryWrites=true&w=majority&minPoolSize=0&maxPoolSize=100&maxIdleTimeMS=300000
I'm looking for some inputs on how to handle this exception, my-questions are:
Is it only the current connection that has issue? or all connection from pool?
If we encounter this issue for the current scheduled job can we safely ignore this? assuming that the next scheduled job will get a different connection and will update DB successfully?

Why is the config server string order sensitive when launching mongos?

and thanks in advance for your time.
For a given sharded setup, mongos is launched while specifying the config server(s) to talk to. Say we start with the following mongos option:
--configdb=cf1,cf2,cf3
Everything is all fine and dandy. If you were to relaunch mongos (or launch a different mongos) with:
-- configdb=cf3,cf2,cf1
It results in the following error:
Tue Jul 9 23:32:41 uncaught exception: error: { "$err" : "could not initialize sharding on connection rs1/db1.me.net:27017,db2.me.net:27017,db3.me.net:27017, :: caused by :: mongos specified a different config database string : stored :cfg1:27017,cfg2:27017,cfg3:27017 vs given :cfg3:27017,cfg2:27017,cfg1:27017","code" : 15907}
My question is, what is the reasoning mongo sensitive to the order of the config server string? I would imagine at some point it parses the different servers hostnames/port, so why not just compare the set? I know you can see from the source code that its just a string comparison, but my question is the underlying reason for this.
Some context to this problem: I am using chef for my mongo deployments. We recently went through the exercise of migrating the config server with the same hostname. However, this still ended up being a disruptive process because the order the chef picked up the config servers had changed, thus changing order mongos starts its process with. I understand that this issue is directly because of chefs functionality, but I am curious as to why Mongo is not this flexible.
Thanks again for your time.
When mongos process changes metadata for sharded cluster, it has to change it in all three config servers "simultaneously" (i.e. all three must agree in order to have a valid metadata change).
If the system were to go down in the middle of such a metadata change, if the config database order was not fixed, there would be a lot more possible permutations of incorrect states to unwind. Requiring a fixed sequence of config dbs allows (a) simpler checking of whether all members of the cluster are viewing the same metadata (b) significant reduction of possible states when a system crashes or otherwise stops unexpectedly.
In addition it reduces chances for "race condition" sorts of bugs if different mongos' could initiate the same operations on different config servers. Even as simple a change as mongos process taking a "virtual" distributed lock to see if a chance is necessary - how could you handle the case of different mongos' checking config servers in different order to check on (and take out) the lock?
As a summary, the three config servers are not a replica set, but one of them still has to be the one that always accepts the changes "first" - think of the order of configdbs to mongos as designation of such "first" status.

MongoDB writeback exception

Our mongodb cluster in production, is a sharded cluster with 3 replica sets with 3 server each one and, of course, another 3 config servers.
We also have 14 webservers that connect directly to mongoDb throw the mongos process that are running in each of this webservers (clients).
The entire cluster receive 5000 inserts per minute.
Sometimes, we start getting exceptions from our java applications when it wants to perform operations to the mongoDb.
This is the stackTrace:
caused by com.mongodb.MongoException: writeback
com.mongodb.CommandResult.getException(CommandResult.java:100)
com.mongodb.CommandResult.throwOnError(CommandResult.java:134)
com.mongodb.DBTCPConnector._checkWriteError(DBTCPConnector.java:142)
com.mongodb.DBTCPConnector.say(DBTCPConnector.java:183)
com.mongodb.DBTCPConnector.say(DBTCPConnector.java:155)
com.mongodb.DBApiLayer$MyCollection.insert(DBApiLayer.java:270)
com.mongodb.DBApiLayer$MyCollection.insert(DBApiLayer.java:226)
com.mongodb.DBCollection.insert(DBCollection.java:147)
com.mongodb.DBCollection.insert(DBCollection.java:90)
com.mongodb.DBCollection$insert$0.call(Unknown Source)
If I check the mongos process throw the rest _status command that it provides, it returns a 200OK. We could fix the problem restarting the tomcat that we are using and restarting the mongos process but I would like to find a final solution to this problem. It's not a happy solution to have to restart everything in the middle of the night.
When this error happens, maybe 2 or 3 another webservers got the same error at the same time, so I imagine that there is a problem in the entire mongoDb cluster, no a problem in a single isolated webserver.
Does anyone know why mongo returns a writeback error? and how to fix it?
I'm using mongoDb 2.2.0.
Thanks in advance.
Fer
I believe you are seeing the Writeback error "leak" into the getLastError output and then continue to be reported even when the operation in question had not errored. This was an issue in the earlier versions of MongoDB 2.2, and has since been fixed, see:
https://jira.mongodb.org/browse/SERVER-7958
https://jira.mongodb.org/browse/SERVER-7369
https://jira.mongodb.org/browse/SERVER-4532
As of writing this answer, I would recommend 2.2.4, but basically whatever the latest 2.2 branch is, to resolve your problem.