orientdb 2.2.26 cluster setup - queries hang - orientdb

Orientdb version - 2.2.26
CLuster - 3 node setup, readQuorum = 2, writeQuorum = "majority", ridBag.embeddedToSbtreeBonsaiThreshold = 2147483647
Nodes - CentOS 7.0, 24 cores and 96 GB RAM
Gremlin-scala/tinkerpop APIs are used for querying and inserting.
This code works fine on single node setup.
Code checks for existing vertex in graph. If vertex does not exist, then the insert operation are batched and send to the db within a transaction.
I see following warnings in orientdb log on all three nodes -
2017-09-15 16:37:31:025 WARNI [dev2] Timeout (852567ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.354 task=record_read(#65:22)) [ODistributedDatabaseImpl]
2017-09-15 16:52:18:239 WARNI [dev2] Timeout (1049042ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.568 task=record_read(#63:24)) [ODistributedDatabaseImpl]
2017-09-15 17:25:22:477 WARNI [dev2] Timeout (1984236ms) on waiting for synchronous responses from nodes=[dev1, dev3, dev2] responsesSoFar=[] request=(id=1.889 task=record_read(#63:24)) [ODistributedDatabaseImpl]
There is no problem on network. Firewall is disabled on all three nodes.
Are these logs related to the problem ?
What else I should check to fix the problem ?

Related

Hikari CP connections are suddenly invalidated

The bounty expires in 5 days. Answers to this question are eligible for a +50 reputation bounty.
Habil Ganbarli is looking for an answer from a reputable source.
Hi Stackoverflow family,
So we have an application with Kotlin & Spring boot that uses a single DB instance(1 GB Memory and instance class is db.t3.micro) as PostgreSQL and is hosted in AWS. What happens for the last couple of days is suddenly connections in my pool are invalidated(2-3 times a day) and the pool size drops drastically. In summary:
Let's say everything is normal in Hikari and the connections are closed and added according to the maxliftime which is 30 minutes by default and the log are like below:
HikariPool-1 - Pool stats (total=40, active=0, idle=40, waiting=0)
HikariPool-1 - Fill pool skipped, pool is at sufficient level.
Suddenly most of the connections become invalidated. Let's say 30 out of 40. The connections are closed before they pass their max lifetime and the logs are like below for all closed connections:
HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#5257d7b2 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
HikariPool-1 - Closing connection org.postgresql.jdbc.PgConnection#7b673105: (connection is dead)
Additionally after these messages followed by multiple of this logs like below:
Add connection elided, waiting 6, queue 13
And the timeout failure stats like below:
HikariPool-1 - Timeout failure stats (total=12, active=12, idle=0, waiting=51)
Finally, I have left with lots of connection timeouts of requests due to the reason that there were no connection available for the most of the requests:
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms
I have added leak-detection-threshold and it also logs like below during the problem happening:
Connection leak detection triggered for org.postgresql.jdbc.PgConnection#3bb5f155 on thread http-nio-8080-exec-482, stack trace follows
java.lang.Exception: Apparent connection leak detected
The hikari config is like below:
hikari:
data-source-properties: stringtype=unspecified
maximum-pool-size: 40
leak-detection-threshold: 30000
When this problem happens queries in PostgreSQL also take a lot of time: 8-9 seconds and increase up to 15-35 seconds. Some queries even 55-65 seconds(which usually take 1-3 seconds at most in usual times). That is why we think it is not a query issue.
In addition to that some sources suggest using try with resources, however, it is not the case for us as we do not obtain connections manually. In addition to that increasing the max pool size from 20 to 40 also did not help. I would really appreciate any comment or hint as we are dealing with this issue for almost a week.

Backup restore fails on multi-user mode error

I have a script to automate restoring a database from a backup. My script first stops all appserver instances, stops all databases, then restores from a backup. Below is the pseudo-code:
foreach appserver:
asbman -name (appserver) -stop
foreach database:
dbman -name (database) -stop
proutil database.db -C enablelargefiles
echo y | prorest database.db backup.bak -verbose
Once my script reaches the prorest command, it outputs the following error:
** The database D:\Directory\Wrk\db\database is in use in multi-user mode. (276)
After waiting ~60 seconds, running the prorest command again executes as expected, and the database is restored correctly. My guess is that there are processes tied to the database that are still running after the database is stopped. I would like to find a solution to this problem without having to use methods such as a sleep-retry to determine when the database is capable of being restored. Is there a solution to this problem, or are there better methods for restoring a database in this way?
There are some timeouts that can come into play:
When an unconditional batch shutdown runs (PROSHUT -by), the following sequence of events takes place:
If there are any running processes left after:
30 Seconds - wake up clients waiting for locks.
60 Seconds - wake up clients waiting for locks.
90 Seconds - wake up clients waiting on screen input.
5 Minutes - Resend the shutdown signal to remaining clients.
10 Minutes - Send a terminate (SIGTERM) signal to remaining clients.
More info here:
http://knowledgebase.progress.com/articles/Article/P3222
You can tail the database.lg file and look for the messages telling you that the database is shut down:
[2017/02/06#20:20:56.353+0100] P-14292 T-13420 I SHUT 5: (542) Server shutdown started by Jens on CON:.
[2017/02/06#20:20:56.499+0100] P-10276 T-11404 I BROKER 0: (15193) The normal shutdown of the database will continue for 10 Min 0 Sec if required.
[2017/02/06#20:20:56.499+0100] P-10276 T-11404 I BROKER 0: (2248) Begin normal shutdown
[2017/02/06#20:20:57.499+0100] P-10276 T-11404 I BROKER 0: (2263) Resending shutdown request to 0 user(s).
[2017/02/06#20:21:01.692+0100] P-10276 T-11404 I BROKER 0: (15109) At Database close the number of live transactions is 0.
[2017/02/06#20:21:01.692+0100] P-10276 T-11404 I BROKER 0: (15743) Before Image Log Completion at Block 1 Offset 5300.
[2017/02/06#20:21:01.693+0100] P-10276 T-11404 I BROKER 0: (453) Logout by Jens on CON:.
[2017/02/06#20:21:01.694+0100] P-10276 T-11404 I BROKER : (16869) Removed shared memory with segment_id: 50528256
[2017/02/06#20:21:01.694+0100] P-10276 T-11404 I BROKER : (334) Multi-user session end.
[2017/02/06#20:21:02.356+0100] P-14292 T-13420 I SHUT 5: (453) Logout by Jens on CON:.
The (334) message is basically telling you that the database is shut down.
Another option could be to check for the database lock file (database.lk). It's only there if the database is running:
...
2017-02-06 20:21 2 228 224 mySportsDb.b1
2017-02-06 20:21 1 703 936 mySportsDb.d1
2017-02-06 20:21 32 768 mySportsDb.db
2017-02-06 20:21 89 643 mySportsDb.lg
2017-02-06 18:00 920 mySportsDb.lic
2017-02-06 20:26 265 mySportsDb.lk
...
There are also a couple of scripts you can run to check the status of the database. See more here:
http://knowledgebase.progress.com/articles/Article/P136887

Mesos Kafka task failed memory limit

I am going to set up a kafka cluster on apache mesos.
I follow the instruction at kafka-mesos on github. I installed a mesos cluster (using Mesosphere without Marathon) with 3 nodes each with 2 CPUs and 4GB memory. I tested the cluster with hello world examples successfully.
I can run kafka-mesos scheduler on it and can add brokers to it.
But when i want to start the broker, an memory limit issued appear.
broker-191-.... TASK_FAILED slave:#c3-S1 reason:REASON_MEMORY_LIMIT
Although, the cluster has 12GB memory, but broker just need 3GB memory with 1GB heap. (I test it with various configuration from 512M till 3GB, but not worked)
What is the problem? and what is the solution?
the complete story is here:
2015-10-17 17:39:24,748 [Jetty-17] INFO ly.stealth.mesos.kafka.HttpServer$ - handling - http://192.168.11.191:7000/api/broker/start
2015-10-17 17:39:28,202 [Thread-605] INFO ly.stealth.mesos.kafka.Scheduler$ - [resourceOffers]
mesos-2#O1160 cpus:2.00 mem:4098.00 disk:9869.00 ports:[31000..32000]
mesos-3#O1161 cpus:2.00 mem:4098.00 disk:9869.00 ports:[31000..32000]
mesos-1#O1162 cpus:2.00 mem:4098.00 disk:9869.00 ports:[31000..32000]
2015-10-17 17:39:28,204 [Thread-605] INFO ly.stealth.mesos.kafka.Scheduler$ - Starting broker 191: launching task broker-191-0abe9e57-b0fb-4d87-a1b4-529acb111940 by offer mesos-2#O1160
broker-191-0abe9e57-b0fb-4d87-a1b4-529acb111940 slave:#c6-S3 cpus:1.00 mem:3096.00 ports:[31000..31000] data:defaults=broker.id\=191\,log.dirs\=kafka-logs\,port\=31000\,zookeeper.connect\=192.168.11.191:2181\\\,192.168.11.192:2181\\\,192.168.11.193:2181\,host.name\=mesos-2\,log.retention.bytes\=10737418240,broker={"stickiness" : {"period" : "10m"\, "stopTime" : "2015-10-17 13:43:29.278"}\, "id" : "191"\, "mem" : 3096\, "cpus" : 1.0\, "heap" : 1024\, "failover" : {"delay" : "1m"\, "maxDelay" : "10m"}\, "active" : true}
2015-10-17 17:39:28,417 [Thread-606] INFO ly.stealth.mesos.kafka.Scheduler$ - [statusUpdate] broker-191-0abe9e57-b0fb-4d87-a1b4-529acb111940 TASK_FAILED slave:#c6-S3 reason:REASON_MEMORY_LIMIT
2015-10-17 17:39:28,418 [Thread-606] INFO ly.stealth.mesos.kafka.Scheduler$ - Broker 191 failed 1, waiting 1m, next start ~ 2015-10-17 17:40:28+03
2015-10-17 17:39:29,202 [Thread-607] INFO ly.stealth.mesos.kafka.Scheduler$ - [resourceOffers]
I found the following in Mesos master log:
...validation.cpp:422] Executor broker-191-... for task broker-191-... uses less CPUs (None) than the minimum required (0.01). Please update your executor, as this will be mandatory in future releases.
...validation.cpp:434] Executor broker-191-... for task broker-191-... uses less memory (None) than the minimum required (32MB). Please update your executor, as this will be mandatory in future releases.
but i set the CPU and MEM for brokers via broker add (update):
broker updated:
id: 191
active: false
state: stopped
resources: cpus:1.00, mem:2048, heap:1024, port:auto
failover: delay:1m, max-delay:10m
stickiness: period:10m, expires:2015-10-19 11:15:53+03
The executor doesn't get the heap setting just the broker. I opened an issue for this https://github.com/mesos/kafka/issues/137. Please increase the mem until a patch is available.
This hasn't been a problem seen I suspect because the mem gets set as a larger value (the size of your data set you don't want to hit disk from when reading) so there is page cache for max efficiencies http://kafka.apache.org/documentation.html#maximizingefficiency

Google Cloud SQL + Hikari CP + Communications link failure

I'm experiencing intermittent connectivity errors from a Spring Boot application communicating with a D1 Google CloudSQL Server with the configuration settings described here HikariCP MySQL settings
I was wondering if anyone has encountered this before.
I've read the FAQ posted here Hikari FAQ and I'm wondering if my default idleTimeout and maxLifeTime (30 mins) settings might be at fault; wait_timeout and interactive_timeout on the server are both set to default 28800s (8 hours).
The FAQ says that these two settings should be about a minute less that the server settings, but if I'm losing connections after 30 minutes I can't quite see how upping the maxLifeTime to 7hrs 59mins is going to improve the situation.
Does anyone have any recommendations?
Redacted stack trace(s):
Get these from time to time
org.springframework.security.authentication.InternalAuthenticationServiceException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30018ms of waiting for a connection.
at org.springframework.security.authentication.dao.DaoAuthenticationProvider.retrieveUser(DaoAuthenticationProvider.java:110)
at org.springframework.security.authentication.dao.AbstractUserDetailsAuthenticationProvider.authenticate(AbstractUserDetailsAuthenticationProvider.java:132)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:156)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:177)
...
Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
....
Caused by: java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:208)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:108)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
... 59 common frames omitted
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:630)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:695)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:727)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:737)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:787)
Hibernate search:
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000030: 31050 documents indexed in 1147865 ms
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000031: Indexing speed: 27.050219 documents/second; progress: 99.89%
2015-02-17 10:41:59.917 WARN 1 --- [ntifierloader-1] com.zaxxer.hikari.proxy.ConnectionProxy : Connection com.mysql.jdbc.JDBC4Connection#372f2018 (HikariPool-0) marked as broken because of SQLSTATE(08S01), ErrorCode(0).
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 1,611,087 milliseconds ago. The last packet sent successfully to the server was 927,899 milliseconds ago.
The indexing isn't particularly quick at the moment I think because I'm not using projections. The process takes about 30 minutes to execute.
Thanks
It could be a couple of things here. First, the network infrastructure (firewalls, load-balancers, etc.) between the application tier and the database tier can impose their own connection timeouts, regardless of MySql settings.
The indexing failure indicates that the connection was out of the pool for ~27 minutes with no SQL activity when that failure occurred.
Second, specifically regarding the "Could not get JDBC Connection" error, you may be running into Cloud SQL connection limits.
I recommend three things. One, make sure you are on the latest HikariCP (2.3.2) and latest MySql Connector/J driver (5.1.34). Two, enable DEBUG-level logging for the com.zaxxer.hikari package. HikariCP debug logging is not "chatty", but will log pool statistics every 30 seconds (and sometimes more detail in failure conditions). Lastly, try setting the maxPoolSize to something smaller (unless already at the default), and setting maxLifeTime to 15 or 20 minutes (1200000ms).
If the error occurs again, post updated logs containing the HikariCP debug logs around the time of failure. Also, feel free to open a tracking issue over on Github as larger logs etc. are easier there.

MSDTC (Distributed Transaction Coordinator) Stops working. Error code -1073737669

I cannot start Distributed Transaction Coordinator service.
It stoped to work few days ago.
When I am trying to start service:
Registry properties:
RPC (For a test values was changed here to oposite and back - without any results ):
Windows logs\application logs:
53283
A MS DTC component has encountered an internal error. The process is being terminated. Error Specifics: DtcSystemShutdown (d:\w7rtm\com\complus\dtc\dtc\msdtc\src\msdtc.cpp#2539): Shutting down with an error
4111
The MS DTC service is stopping.
4102
DTC Security Configuration values (OFF = 0 and ON = 1): Network Administration of Transactions = 1,
Network Clients = 1,
Inbound Distributed Transactions using Native MSDTC Protocol = 1,
Outbound Distributed Transactions using Native MSDTC Protocol = 1,
Transaction Internet Protocol (TIP) = 0,
XA Transactions = 1,
SNA LU 6.2 Transactions = 1
Could not initialize the MS DTC Transaction Manager.
4356
Failed to initialize the MS DTC Communication Manager. Error Specifics: hr = 0x80070057, d:\w7rtm\com\complus\dtc\dtc\cm\src\ccm.cpp:2117, CmdLine: C:\Windows\System32\msdtc.exe, Pid: 5332
4358
The MS DTC Connection Manager is unable to register with RPC to use one of LRPC, TCP/IP, or UDP/IP. Please ensure that RPC is configured properly. If "ServerTcpPort" registry key is configured(DWORD value under the HKEY_LOCAL_MACHINE\Software\Microsoft\MSDTC for local DTC instance or under cluster hive for clustered DTC instance), please verify if the configured port is valid and the port is not already in use by a different component. Error Specifics:hr = 0x80070057, d:\w7rtm\com\complus\dtc\dtc\cm\src\iomgrsrv.cpp:2523, CmdLine: C:\Windows\System32\msdtc.exe, Pid: 5332
4156
String message: RPC raised an exception with a return code RPC_S_INVALIDA_ARG..
I found that we can use -resetlog command. But this doesnot resolving my problem:
Firewall is disabled.
Try to delete key HKLM\Software\Microsoft\Rpc\Internet from registry.
To get around this issue, I had to copy the log file (I accidentally deleted) to the location specified by the Local DTC Log information location.