High tomcat thread count caused by threads waiting on MultiThreadedHttpConnectionManager ConnectionPool - httpclient

We've recently started seeing spikes in the thread counts on our tomcat servers (peaking at over 1000 when normally at around 100). We performed a thread dump on one of the tomcat servers whilst it's thread count was high and found that a large number of the threads were waiting on MultiThreadedHttpConnectionManager$ConnectionPool, stack trace as follows:
"TP-Processor21700" daemon prio=10 tid=0x4a0b3400 nid=0x2091 in Object.wait() [0x399f3000..0x399f4004]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x58ee5030> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked <0x58ee5030> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
...
There are 3 points in our code where httpClient.executeMethod() is called (to obtain info via an http request to another tomcat server). In each case the GetMethod object passed to it has had its socket timeout value set (i.e. via getMethod.getParams().setSoTimeout();) before hand, and the MultiThreadedConnectionManager is configured in spring to have a connectionTimeout value of 10 seconds. One thing I have noticed is that only 2 of the 3 httpClient.executeMethod() invocations are followed by a call to getMethod.releaseConnection(), so I'm wondering if this may be the cause of the problem (i.e. connections not being explicitly released). However what's strange is that
the problem has only started occurring in the last few days and the source code has not been modified for over a year, plus the fact that there has been no recent surge in requests coming through to the tomcat servers. One change that did occur a couple of days before the problem started to occur was that we upgraded the JVM used by the tomcat server from Java 5 (1.5 update 14) to Java 6 (1.6 update 25). We have tried temporarily reverting the JVM version to Java 5 to see if the problem stopped occurring but it did not. Another point to note is that in most cases the tomcat server eventually recovers and
the thread count drops back to normal - we've only had one instance where a tomcat process appears to have crashed because of the thread count increase.
We are running Tomcat 5.5 with commons-httpclient-3.1.jar running against a Java 1.6 update 25 on a Red Hat linux environment.
Please let me know if you can suggest any ideas as to what may be the cause of this issue.
Thanks.

The problem was indeed caused by the fact that only 2 of the 3 httpClient.executeMethod(getMethod) invocations were followed by a call to getMethod.releaseConnection(). Ensuring all 3 httpClient.executeMethod(getMethod) invocations were inside a try/catch block followed by a finally block containing a call to getMethod.releaseConnection() prevented the high thread counts from occurring. Although this code had been in our live system for over a year, it appears that the reason the high thread count issue only recently started occurring was because various search engine crawlers had started hitting the site with lots of URL requests that caused the code where the connection was being used but not subsequently released to execute. Problem solved.

Related

TIMED_WAITING on message listener thread

I use ActiveMQ Artemis 2.10.1 and getting message listener thread hanging issue.
Thread is going into TIMED_WAITING and recover only after client JVM restart. This is an indeterminate issue and not able to reproduce easily. Client library version is 2.16.0.
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.waitCompletion(LargeMessageControllerImpl.java:301)
- locked <0x000000050cd4e4f0> (a org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
at org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.saveBuffer(LargeMessageControllerImpl.java:275)
- locked <0x000000050cd4e4f0> (a org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
at org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.checkBuffer(ClientLargeMessageImpl.java:159)
at org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.getBodyBuffer(ClientLargeMessageImpl.java:91)
at org.apache.activemq.artemis.jms.client.ActiveMQBytesMessage.readBytes(ActiveMQBytesMessage.java:220)
at com.eu.jms.JMSEventBus.onMessage(JMSEventBus.java:385)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:746)
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:684)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:651)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:317)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:255)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1166)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1158)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1055)
at java.lang.Thread.run(Thread.java:748) ```
The client is waiting in LargeMessageControllerImpl.waitCompletion. This wait will not block forever. The code waits in a loop for packets of a large message. As long as packets of the large message are still arriving the client will continue to wait until all the packets have arrived or if a packet doesn't arrive within the given timeout it will throw an error. The timeout is based on the callTimeout which is configured on the client's URL. The default callTimeout is 30000 (i.e. 30 seconds).
My guess is that your client is receiving a very large message or the network has slown down or perhaps a combination of these things. You can turn on TRACE logging for org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl to see the individual large message packets arriving at the client if you want more insight into what's happening.
To be clear, it's not surprising that thread dumps show your client waiting here as this is the most likely place for the code to be waiting while it receives a large message. It doesn't mean the client is stuck.
Keep in mind that if there is an actual network error or loss of connection the client will throw an error. Also, the client maintains an independent thread which sends & receives "ping" packets to & from the broker respectively. If the client doesn't get the expected ping response then it will throw an error as well. The fact that none of this happened with your client indicates the connection is valid.
I would recommend checking the size of the message at the head of the queue. The broker supports arbitrarily large messages so it could potentially be many gigs which the client will happily sit and receive as long as the connection is valid.

kafka connect 100% cpu

I am using kafka_2.12-2.3.1 on ubuntu 18 with JDK 13. While everything is working fine, I noticed that Kafka connect is consuming 100% cpu. Even though there are no messages to process it is continuing to consume cpu.
Took threaddump but there are no clues in it.
I am using Postgres-connector with Postgre 11.X. Also I am starting Kafka connect in distributed mode.
UPDATE 1
After searching through various sites, I got few hints and realised that it could be a bug in JDK itself (caused by polling). I saw a section of Thread dump which was highlighted by most
I will check the behavior with Standalone kafka connect.
"KafkaBasedLog Work Thread - connect-offsets" #31 prio=5 os_prio=0 cpu=828.02ms elapsed=1039.24s tid=0x00007f8530089800 nid=0x305b runnable [0x00007f852a3f4000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPoll.wait(java.base#13.0.1/Native Method)
at sun.nio.ch.EPollSelectorImpl.doSelect(java.base#13.0.1/EPollSelectorImpl.java:120)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base#13.0.1/SelectorImpl.java:124)
- locked <0x00000000809efd98> (a sun.nio.ch.Util$2)
- locked <0x00000000809efd40> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(java.base#13.0.1/SelectorImpl.java:136)
at org.apache.kafka.common.network.Selector.select(Selector.java:794)
at org.apache.kafka.common.network.Selector.poll(Selector.java:467)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:539)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:262)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:233)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1281)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1225)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1201)
at org.apache.kafka.connect.util.KafkaBasedLog.poll(KafkaBasedLog.java:262)
at org.apache.kafka.connect.util.KafkaBasedLog.access$500(KafkaBasedLog.java:71)
at org.apache.kafka.connect.util.KafkaBasedLog$WorkThread.run(KafkaBasedLog.java:337)
UPDATE 2
Tried standalone mode and the CPU is still topping to 100%. I confirm that there are no messages for processing.

Google Cloud SQL + Hikari CP + Communications link failure

I'm experiencing intermittent connectivity errors from a Spring Boot application communicating with a D1 Google CloudSQL Server with the configuration settings described here HikariCP MySQL settings
I was wondering if anyone has encountered this before.
I've read the FAQ posted here Hikari FAQ and I'm wondering if my default idleTimeout and maxLifeTime (30 mins) settings might be at fault; wait_timeout and interactive_timeout on the server are both set to default 28800s (8 hours).
The FAQ says that these two settings should be about a minute less that the server settings, but if I'm losing connections after 30 minutes I can't quite see how upping the maxLifeTime to 7hrs 59mins is going to improve the situation.
Does anyone have any recommendations?
Redacted stack trace(s):
Get these from time to time
org.springframework.security.authentication.InternalAuthenticationServiceException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30018ms of waiting for a connection.
at org.springframework.security.authentication.dao.DaoAuthenticationProvider.retrieveUser(DaoAuthenticationProvider.java:110)
at org.springframework.security.authentication.dao.AbstractUserDetailsAuthenticationProvider.authenticate(AbstractUserDetailsAuthenticationProvider.java:132)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:156)
at org.springframework.security.authentication.ProviderManager.authenticate(ProviderManager.java:177)
...
Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
....
Caused by: java.sql.SQLException: Timeout after 30023ms of waiting for a connection.
at com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:208)
at com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:108)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
... 59 common frames omitted
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:630)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:695)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:727)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:737)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:787)
Hibernate search:
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000030: 31050 documents indexed in 1147865 ms
2015-02-17 10:34:17.090 INFO 1 --- [ entityloader-2] o.h.s.i.SimpleIndexingProgressMonitor : HSEARCH000031: Indexing speed: 27.050219 documents/second; progress: 99.89%
2015-02-17 10:41:59.917 WARN 1 --- [ntifierloader-1] com.zaxxer.hikari.proxy.ConnectionProxy : Connection com.mysql.jdbc.JDBC4Connection#372f2018 (HikariPool-0) marked as broken because of SQLSTATE(08S01), ErrorCode(0).
com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
The last packet successfully received from the server was 1,611,087 milliseconds ago. The last packet sent successfully to the server was 927,899 milliseconds ago.
The indexing isn't particularly quick at the moment I think because I'm not using projections. The process takes about 30 minutes to execute.
Thanks
It could be a couple of things here. First, the network infrastructure (firewalls, load-balancers, etc.) between the application tier and the database tier can impose their own connection timeouts, regardless of MySql settings.
The indexing failure indicates that the connection was out of the pool for ~27 minutes with no SQL activity when that failure occurred.
Second, specifically regarding the "Could not get JDBC Connection" error, you may be running into Cloud SQL connection limits.
I recommend three things. One, make sure you are on the latest HikariCP (2.3.2) and latest MySql Connector/J driver (5.1.34). Two, enable DEBUG-level logging for the com.zaxxer.hikari package. HikariCP debug logging is not "chatty", but will log pool statistics every 30 seconds (and sometimes more detail in failure conditions). Lastly, try setting the maxPoolSize to something smaller (unless already at the default), and setting maxLifeTime to 15 or 20 minutes (1200000ms).
If the error occurs again, post updated logs containing the HikariCP debug logs around the time of failure. Also, feel free to open a tracking issue over on Github as larger logs etc. are easier there.

JVM crashes frequently

JVM crashes surprizingly and frequently on our prod environment and results in Jboss (EAP6.3) going down. We have java7 U72 installed
Crash logs has same output where current thread is:
Current thread (0x00000000d1d99000): JavaThread "Lucene Merge Thread #0" daemon [_thread_in_Java, id=1144, stack(0x00000000f6a00000,0x00000000f6b00000)]
and all the log is full of :
JavaThread "elasticsearch[Node BD852E44][search][T#68]" daemon [_thread_blocked, id=14396, stack(0x00000000f7b30000,0x00000000f7c30000)]
elasticsearch is some were related to indexing and it uses Lucene in hood as far as I understand but we have number or application deployed how to check on this can someone please help. complete crash logs are at : http://pastebin.com/845LU9iK
Looks like it didn't manage to record stack traces for the affected thread.
If that's the same for all crashes then it doesn't seem to match known lucene or jboss bugs.
# guarantee(result == EXCEPTION_CONTINUE_EXECUTION) failed: Unexpected result from topLevelExceptionFilter
AIUI this indicates an error in native exception handling, so it's one error masking another, probably making this crash log fairly useless.
So I can only provide really generic advice:
you're using an older JVM version, update to the latest java 7, java 8 or possibly even a java 9 dev build and see if it goes away. Even if they still crash they might provide different/more useful error reports
to diagnose potential compiler bugs you can try running with the following flags
-XX:-TieredCompilation 1 should disable the C1 compiler
-XX:+TieredCompilation -XX:TieredStopAtLevel=1 should disable the C2 compiler
-Xint disables all JIT, very slow
ask on the hotspot-dev mailing list for further guidance
1: Tiered compilation is a new java 7 feature, it basically combines the interpreter, C1 and C2 JIT compilers (which formerly were used separately in the client and server VMs) into different optimizing stages.
Each of them can have optimization bugs. Turning off individual stages helps isolating them as potential cause.
Edit: The new crash report is more useful since it at least has java frames, the interesting part is the following:
J 1559 sun.misc.Unsafe.getByte(J)B (0 bytes) # 0x000000000178e99b [0x000000000178e960+0x3b]
j java.nio.DirectByteBuffer.get()B+11
j org.apache.lucene.store.ByteBufferIndexInput.readByte()B+4
J 9447 C2 org.apache.lucene.store.DataInput.readVInt()I (114 bytes) # 0x000000000348cc00 [0x000000000348cbc0+0x40]
DataInput.readVInt seems to be an ongoing source of grief, see this SO answer for possible solutions

Orientdb client using 100% cpu after connection has been established

When I use the following code in a servlet running in tomcat7 (also tested in tomcat6) on Ubuntu 10.10 Server (openjdk 1.6.0_20 64bit), the java process uses 100% cpu and above after the connection has been established once.
ODatabaseObjectTx db = ODatabaseObjectPool.global().acquire("remote:localhost/test", "test", "test");
db.getEntityManager().registerEntityClass(BlogPost.class);
List<BlogPost> posts = db.query(new OSQLSynchQuery<BlogPost>("select * from BlogPost order by date desc"));
db.close();
Does anyone know how to resolve this issue?
EDIT:
The issue also happens right after acquire. There is a Thread "ClientService" being spawned that causes the high load.
I took several thread dumps and it always shows the same for this thread:
"ClientService" daemon prio=10 tid=0x00007f4d88344800 nid=0x558e runnable [0x00007f4d872da000]
java.lang.Thread.State: RUNNABLE
at java.util.concurrent.locks.ReentrantLock$Sync.tryRelease(ReentrantLock.java:127)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(AbstractQueuedSynchronizer.java:1239)
at java.util.concurrent.locks.ReentrantLock.unlock(ReentrantLock.java:431)
at com.orientechnologies.orient.enterprise.channel.binary.OChannelBinaryAsynch.endResponse(OChannelBinaryAsyn
ch.java:73)
at com.orientechnologies.orient.client.remote.OStorageRemoteServiceThread.execute(OStorageRemoteServiceThread
.java:59)
at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:48)
When I suspend the thread in the debugger, the high load stops until I press continue.
Happens after the acquire() or the query?