SocketException in Mongo - mongodb

I just set up a replica set in Mongo (prod environment). I'm now getting a lot of exceptions like below (clipped).
I went into mongo and ran a serverStatus command on my primary mongo node and only have about 300 connections going, so it's hardly working.
Below are my connection option settings in my server code:
auto_connect_retry = false
connections_per_host = 10
threads_multiplier = 10
max_wait_time = 120000
connect_timeout = 10000
socket_timeout = 0
Do I have something mis-configured?
Sep 9, 2013 8:31:26 PM com.mongodb.DBPortPool gotError
WARNING: emptying DBPortPool to /10.0.8.10:27017 b/c of error
java.net.SocketException: Connection timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:146)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at org.bson.io.Bits.readFully(Bits.java:46)
at org.bson.io.Bits.readFully(Bits.java:33)
at org.bson.io.Bits.readFully(Bits.java:28)
at com.mongodb.Response.<init>(Response.java:40)
at com.mongodb.DBPort.go(DBPort.java:142)
at com.mongodb.DBPort.call(DBPort.java:92)
at com.mongodb.DBTCPConnector.innerCall(DBTCPConnector.java:244)
at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:216)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:288)
at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:273)
at com.mongodb.DBCollection.findOne(DBCollection.java:347)
at com.mongodb.DBCollection.findOne(DBCollection.java:332)
at com.mongodb.casbah.MongoCollectionBase$class.findOneByID(MongoCollection.scala:232)
at com.mongodb.casbah.MongoCollection.findOneByID(MongoCollection.scala:866)
at com.novus.salat.dao.SalatDAO.findOneById(SalatDAO.scala:353)
at com.novus.salat.dao.ModelCompanion$class.findOneById(ModelCompanion.scala:173)

Generally a connection timeout occurs from one of the following in a replica set
1) All members are not able to communicate with each other
2) A program is connecting to replica for update and it is unable to send it to primary due to overload or 1st as well
3) All relicas are not in sync and one is lagging behind too much
4) Leader election is going on but not completed due to some reason
Please check if your relica set is consistent and all nodes are working by issuing rs.status() on primary node , also as earlier suggested check primary logs for more information

Related

Postgres crashes when selecting from view

I have a view in Postgres with the following definition:
CREATE VIEW participant_data_view AS
SELECT participant_profile.*,
"user".public_id, "user".created, "user".status, "user".region,"user".name, "user".email, "user".locale,
(SELECT COUNT(id) FROM message_log WHERE message_log.target_id = "user".id AND message_log.type = 'diary') AS diary_reminder_count,
(SELECT SUM(pills) FROM "order" WHERE "order".user_id = "user".id AND "order".status = 'complete') AS pills
FROM participant_profile
JOIN "user" ON "user".id = participant_profile.id
;
The view creation works just fine. However, when I query the view SELECT * FROM participant_data_view, postgres crashes with
10:24:46.345 WARN HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#172d19fe marked as broken because of SQLSTATE(08006), ErrorCode(0) c.z.h.p.ProxyConnection
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
this question suggests to me that it might be an internal assertion that causes it to crash.
If I remove the diary_reminder_count field from the view definition, the select works just fine.
What am I doing wrong? How can I fix my view, or change it so I can query the same data in a different way?
Note that creating the view works just fine, it only crashes when querying it.
I tried running explain (analyze) select * from participant_data_view; from the IntelliJ query console, which only returns
[2020-12-08 11:13:56] [08006] An I/O error occurred while sending to the backend.
[2020-12-08 11:13:56] java.io.EOFException
I ran the same using psql, there it returns
my-database=# explain (analyze) select * from participant_data_view;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Looking at the log files, it contains:
2020-12-08 10:24:01.383 CET [111] LOG: server process (PID 89670) was terminated by signal 9: Killed: 9
2020-12-08 10:24:01.383 CET [111] DETAIL: Failed process was running: select "public"."participant_data_view"."id", "public"."participant_data_view"."study_number", <snip many other fields>,
"public"."participant_data_view"."diary_reminder_count", "public"."participant
2020-12-08 10:24:01.383 CET [111] LOG: terminating any other active server processes
In all likelihood, the Linux kernel out-of-memory killer killed your query because the system ran out of memory.
Either restrict the number of database sessions (for example with a connection pool) or reduce work_mem.
It is usually a good idea to set vm.overcommit_memory = 2 in the kernel and tune vm.overcommit_ratio appropriately.

org.postgresql.util.PSQLException: This connection has been closed. error for long running transactions

We are getting "org.postgresql.util.PSQLException: This connection has been closed." on one of our deployments for only long running transactions (more than a few minutes):
Caused by: org.hibernate.TransactionException: rollback failed
at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.rollback(AbstractTransactionImpl.java:217)
at org.springframework.orm.hibernate4.HibernateTransactionManager.doRollback(HibernateTransactionManager.java:604)
... 87 more
Caused by: org.hibernate.TransactionException: unable to rollback against JDBC connection
at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doRollback(JdbcTransaction.java:167)
at org.hibernate.engine.transaction.spi.AbstractTransactionImpl.rollback(AbstractTransactionImpl.java:211)
... 88 more
Caused by: org.postgresql.util.PSQLException: This connection has been closed.
at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822)
at org.postgresql.jdbc2.AbstractJdbc2Connection.rollback(AbstractJdbc2Connection.java:839)
at org.apache.commons.dbcp2.DelegatingConnection.rollback(DelegatingConnection.java:492)
at org.apache.commons.dbcp2.DelegatingConnection.rollback(DelegatingConnection.java:492)
at org.hibernate.engine.transaction.internal.jdbc.JdbcTransaction.doRollback(JdbcTransaction.java:163)
... 89 more
Our stack is as follows:
Postgresql 9.2 (on db server Ubuntu 16.03)
PgBouncer (on application server Ubuntu 16.03)
Jars (on application server Ubuntu 16.03)
org.postgresql:postgresql:9.2-1004-jdbc41
javax.transaction:jta:1.1
org.apache.commons:commons-pool2:2.4.2
org.apache.commons:commons-dbcp2:2.1.1'
Postgresql and Pgbouncer use the default parameters and we use the following parameters for dbcp:
database-initial-size = 2
database-max-total = 200
database-validation-query = SELECT 1
database-test-on-borrow = true
database-test-while-idle = true
database-max-wait-millis = 3000
database-time-between-eviction-runs-millis = 34000
database-min-evictable-idle-time-millis = 55000
We have other deployments with same parameters but we are not having the same problem there.
I suspect there is a Firewall/Nat which resets the connection after some timeout but I don't know how to check if this is the case. I would appreciate very much if you can guide me about what logs/parameters/configurations to check which may cause this exception.
I have tested that if Postgresql and PgBouncer are on the same server this problem does not occur.
I have also investigated the Postgresql logs and no error messages are logged.

Data loss due to unexpected failover of MongoDB replica set

So I encountered the following issue recently:
I have a 5-member set replica set (priority)
1 x primary (2)
2 x secondary (0.5)
1 x hidden backup (0)
1 x arbiter (0)
One of the secondary replicas with 0.5 priority (let's call it B) encountered some network issue and had intermittent connectivity with the rest of the replica set. However, despite having staler data and a lower priority than the existing primary (let's call it A) it assumed primary role:
[ReplicationExecutor] VoteRequester: Got no vote from xxx because: candidate's data is staler than mine, resp:{ term: 29, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
[ReplicationExecutor] election succeeded, assuming primary role in term 29
[ReplicationExecutor] transition to PRIMARY
And for A, despite not having any connection issues with the rest of the replica set:
[ReplicationExecutor] stepping down from primary, because a new term has begun: 29
So Question 1 is, how could this have been possible given the circumstances?
Moving on, A (now a secondary) began rolling back data:
[rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 28, timestamp: xxx). source's GTE: (term: 29, timestamp: xxx) hashes: (xxx/xxx)
[rsBackgroundSync] beginning rollback
[rsBackgroundSync] rollback 0
[ReplicationExecutor] transition to ROLLBACK
This caused data which was written to be removed. So Question 2 is: How does an OplogStart go missing?
Last but not least, Question 3, how can this be prevented?
Thank you in advance!
You are using version 3.2.x and protocolVersion=1 (you can check it with rs.conf() -command)? Because there is "bug" on voting.
You can prevent this bug by (choose one or both):
change protocolVersion to 0.
cfg = rs.conf();
cfg.protocolVersion=0;
rs.reconfig(cfg);
change all priorities to same value
EDIT:
These are tickets what explain.. More or less..
Ticket 1
Ticket 2

MongoS sharding metadata manager failed asking for instance is manually reset

My MongoS servers are not staring they are sending this error in logs.
SHARDING [Balancer] caught exception while doing balance: Server's
sharding metadata manager failed to initialize and will remain in
this state until the instance is manually reset :: caused by ::
HostNotFound: unable to resolve DNS for host confserv_1.xyz.com
2016-05-02T17:57:06.612+0530 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "DB2255-2016-05-02T17:57:06.611+0530-5727479aa1051c5fb04fcc49", server: "mongoS1", clientAddr: "", time: new Date(1462192026611), what: "balancer.round", ns: "", details: { executionTimeMillis: 35, errorOccured: true, errmsg: "Server's sharding metadata manager failed to initialize and will remain in this state until the instance is manually reset :: caused by :: HostNotFoun..." } }
When I connect config server using host name it is working fine.
I tried to restart MongoS server it is not coming up.
I check Mongo code and found this error mentioned in
https://github.com/mongodb/mongo/blob/master/src/mongo/db/s/sharding_state.cpp
/ TODO: remove after v3.4.
// This is for backwards compatibility with old style initialization through metadata
// commands/setShardVersion. As well as all assignments to _initializationStatus and
// _setInitializationState_inlock in this method.
if (_getInitializationState() == InitializationState::kInitializing) {
auto waitStatus = _waitForInitialization_inlock(deadline, lk);
if (!waitStatus.isOK()) {
return waitStatus;
}
}
if (_getInitializationState() == InitializationState::kError) {
return {ErrorCodes::ManualInterventionRequired,
str::stream() << "Server's sharding metadata manager failed to initialize and will "
"remain in this state until the instance is manually reset"
<< causedBy(_initializationStatus)};
}
But it does not mention anything what manual intervention is required.
Current Mongo version is 3.2.6
I just ran into this problem while trying to harden the security configuration. As in your case, I was able to connect to the config servers from all mongos instances.
In my case I was also testing a case with members of replica sets being in different datacenters, and I had the problem only after steppingDown some primaries.
I noticed at the end that, not as the error message is pretending, the issue was happening on some primaries of one datacenter, who were not able to route back to the config server. After fixing the routing problem (/etc/hosts eventually), no more problems occurred on the mongo side.

Mongo CursorNotFound exception in active cursor via Grails domain criteria

I'm using Grails 2.4.4, mongo plugin 3.0.2, MongoDB 2.4.10 using a remote database connection.
grails {
mongo {
host = "11.12.13.14" // A remote server IP
port = 27017
databaseName = "blogger"
username = "blog"
password = "xyz"
options {
autoConnectRetry = true
connectTimeout = 3000
connectionsPerHost = 40
socketTimeout = 120000
threadsAllowedToBlockForConnectionMultiplier = 5
maxAutoConnectRetryTime=5
maxWaitTime=120000
}
}
}
In a part of our application, a service method iterates over a 20,000 user's and sends them an email:
Person.withCriteria { // Line 323
eq("active", true)
order("dateJoined", "asc")
}.each { personInstance ->
// Code to send an email which takes an average of 1 second
}
After executing this for some 6000 user, I'm getting a MongoDB cursor exception:
2015-04-11 07:31:14,218 [quartzScheduler_Worker-1] ERROR listeners.ExceptionPrinterJobListener - Exception occurred in job: Grails Job
org.quartz.JobExecutionException: com.mongodb.MongoException$CursorNotFound: Cursor 1337814790631604331 not found on server 11.12.13.14:27017 [See nested exception: com.mongodb.MongoException$CursorNotFound: Cursor 1337814790631604331 not found on server 11.12.13.14:27017]
at grails.plugins.quartz.GrailsJobFactory$GrailsJob.execute(GrailsJobFactory.java:111)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: com.mongodb.MongoException$CursorNotFound: Cursor 1337814790631604331 not found on server 11.12.13.14:27017
at com.mongodb.QueryResultIterator.throwOnQueryFailure(QueryResultIterator.java:218)
at com.mongodb.QueryResultIterator.init(QueryResultIterator.java:198)
at com.mongodb.QueryResultIterator.initFromQueryResponse(QueryResultIterator.java:176)
at com.mongodb.QueryResultIterator.getMore(QueryResultIterator.java:141)
at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:127)
at com.mongodb.DBCursor._hasNext(DBCursor.java:551)
at com.mongodb.DBCursor.hasNext(DBCursor.java:571)
at org.grails.datastore.mapping.mongo.query.MongoQuery$MongoResultList$1.hasNext(MongoQuery.java:1893)
at com.test.person.PersonService.sendWeeklyEmail(PersonService.groovy:323)
at com.test.WeeklyJob.execute(WeeklyJob.groovy:41)
at grails.plugins.quartz.GrailsJobFactory$GrailsJob.execute(GrailsJobFactory.java:104)
... 2 more
I looked for documentation and found that cursor automatically get closed in 20 minutes and when I confirmed it with the logs, this exception came exactly after 20 minutes.
But this behaviour of auto-close in 20 minutes is applicable for inactive cursor but here the cursor is active.
UPDATE:
I read some articles and found that, this could be an issue of TCP keepalive timeout. So we changed the TCP keepalive timeout to 2 minutes from default 2 hours, but it still doesn't solve the problem.
Looks like a compatibility issue with MonoDB on this server. Read more on the Jira for details https://jira.mongodb.org/browse/SERVER-18439
Hope this helps to someone else!