TIMED_WAITING on message listener thread - activemq-artemis

I use ActiveMQ Artemis 2.10.1 and getting message listener thread hanging issue.
Thread is going into TIMED_WAITING and recover only after client JVM restart. This is an indeterminate issue and not able to reproduce easily. Client library version is 2.16.0.
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.waitCompletion(LargeMessageControllerImpl.java:301)
- locked <0x000000050cd4e4f0> (a org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
at org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.saveBuffer(LargeMessageControllerImpl.java:275)
- locked <0x000000050cd4e4f0> (a org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
at org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.checkBuffer(ClientLargeMessageImpl.java:159)
at org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.getBodyBuffer(ClientLargeMessageImpl.java:91)
at org.apache.activemq.artemis.jms.client.ActiveMQBytesMessage.readBytes(ActiveMQBytesMessage.java:220)
at com.eu.jms.JMSEventBus.onMessage(JMSEventBus.java:385)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doInvokeListener(AbstractMessageListenerContainer.java:746)
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:684)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:651)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.doReceiveAndExecute(AbstractPollingMessageListenerContainer.java:317)
at org.springframework.jms.listener.AbstractPollingMessageListenerContainer.receiveAndExecute(AbstractPollingMessageListenerContainer.java:255)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.invokeListener(DefaultMessageListenerContainer.java:1166)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.executeOngoingLoop(DefaultMessageListenerContainer.java:1158)
at org.springframework.jms.listener.DefaultMessageListenerContainer$AsyncMessageListenerInvoker.run(DefaultMessageListenerContainer.java:1055)
at java.lang.Thread.run(Thread.java:748) ```

The client is waiting in LargeMessageControllerImpl.waitCompletion. This wait will not block forever. The code waits in a loop for packets of a large message. As long as packets of the large message are still arriving the client will continue to wait until all the packets have arrived or if a packet doesn't arrive within the given timeout it will throw an error. The timeout is based on the callTimeout which is configured on the client's URL. The default callTimeout is 30000 (i.e. 30 seconds).
My guess is that your client is receiving a very large message or the network has slown down or perhaps a combination of these things. You can turn on TRACE logging for org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl to see the individual large message packets arriving at the client if you want more insight into what's happening.
To be clear, it's not surprising that thread dumps show your client waiting here as this is the most likely place for the code to be waiting while it receives a large message. It doesn't mean the client is stuck.
Keep in mind that if there is an actual network error or loss of connection the client will throw an error. Also, the client maintains an independent thread which sends & receives "ping" packets to & from the broker respectively. If the client doesn't get the expected ping response then it will throw an error as well. The fact that none of this happened with your client indicates the connection is valid.
I would recommend checking the size of the message at the head of the queue. The broker supports arbitrarily large messages so it could potentially be many gigs which the client will happily sit and receive as long as the connection is valid.

Related

ActiveMQ Artemis publish message loss during HA fail-over

I use ActiveMQ Artemis 2.17.0, and I'm looking to avoid message loss in a producer during fail-over.
Message publish loss handled during Artemis active to passive switch by catching ActiveMQUnBlockedException and sending the message again.
The brokers are configured as active/passive HA shared-store. Active node configured in host1 and passive node configured in host2.
Url is:
(tcp://host1:61616,tcp://host2:61616)?ha=true&reconnectAttempts=-1&blockOnDurableSend=false
blockOnDurableSend set to false for high throughput.
During active to passive switch publishing code throws ActiveMQUnBlockedException but not during passive to active switching.
We're using Spring 4.2.5 and CachingConnectionFactory for connection factory.
I'm using the following code to send messages:
private void sendMessageInternal(final ConnectionFactory connectionFactory, final Destination queue, final String message)
throws JMSException {
try (final Connection connection = connectionFactory.createConnection();) {
connection.start();
try (final Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
final MessageProducer producer = session.createProducer(queue);) {
final TextMessage textMessage = session.createTextMessage(message);
producer.send(textMessage);
}
} catch (JMSException thr) {
if (thr.getCause() instanceof ActiveMQUnBlockedException) {
// consider as fail-over disconnection, send message again.
} else {
throw thr;
}
}
}
In host1 machine, Artemis deployed as master - node1.
In host2 machine, Artemis deployed as slave - node2.
following steps I did to simulate fail-over
node1 and node2 started
node1 started as live server and node2 started as backup server
killed node1, node2 become live server
client publish code threw ActiveMQUnBlockedException and handled to send message again
started node1 again. node1 become live server and node2 become backup again
client publish code did not throw ActiveMQUnBlockedException and loss in message.
Getting following error stack during step #3. ( Killed node1 and node2 become Live server).
javax.jms.JMSException: AMQ219016: Connection failure detected. Unblocking a blocking call that will never get a response
at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:540)
at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:434)
at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sessionStop(ActiveMQSessionContext.java:470)
at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.stop(ClientSessionImpl.java:1121)
at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.stop(ClientSessionImpl.java:1110)
at org.apache.activemq.artemis.jms.client.ActiveMQSession.stop(ActiveMQSession.java:1244)
at org.apache.activemq.artemis.jms.client.ActiveMQConnection.stop(ActiveMQConnection.java:339)
at org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.localStop(SingleConnectionFactory.java:644)
at org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.invoke(SingleConnectionFactory.java:577)
at com.sun.proxy.$Proxy5.close(Unknown Source)
at com.eu.amq.failover.test.ProducerNodeTest.sendMessageInternal(ProducerNodeTest.java:133)
at com.eu.amq.failover.test.ProducerNodeTest.sendMessage(ProducerNodeTest.java:110)
at com.eu.amq.failover.test.ProducerNodeTest.main(ProducerNodeTest.java:90)
The ActiveMQUnBlockedException you're getting is coming from Spring's invocation of javax.jms.Connection#stop. It's not related to sending a message. Re-sending a message when you get this specific exception could result in a duplicate message.
Ultimately your problem is directly related to setting blockOnDurableSend=false. This tells the client to "fire and forget." In other words the client won't wait for a response from the broker to ensure the message actually made it successfully. This lack of waiting increases throughput but decreases reliability.
If you really want to mitigate potential message loss you have two main options.
Set blockOnDurableSend=true. This will reduce message throughput, but it's the simplest way to guarantee the message arrived at the broker successfully.
Use a CompletionListener. This will allow you to keep blockOnDurableSend=false, but the application will still be informed if there are problems sending the message although the information will be provided asynchronously. This feature was added in JMS 2 specifically for this kind of scenario. See the JavaDoc for more details.

Akka connection actor has terminated

I'm working on a REST API that uses Akka, we inherited it from a previous team, and none of us have experience with Akka before this.
Akka is being used to process the data the API is returning, and acting as the HTTP server.
Recently when the API was under load, we started getting failures like so:
),HttpProtocol(HTTP/1.1)), Response: HttpResponse(500 Internal Server Error,List(),HttpEntity.Strict(text/plain; charset=UTF-8,
Error Code: 500
Type: Internal Server Error
Stack Trace:
akka.stream.StreamTcpException: The connection actor has terminated. Stopping now.
),HttpProtocol(HTTP/1.1)), Time: 6430 ms
I have no idea where the above error is happening in the code, or how to appropriately handle this error when it happens.
Can anyone give suggestions on how to trace this down further, or suggestions on how to handle and recover from these types of issues?

Connection failure has been detected: HQ119014: Did not receive data from invm:0

actually I have a TimerWS running correctly, but in a few hours I have this error
WARN [org.hornetq.core.client] (hornetq-failure-check-thread) HQ212037: Connection failure has been detected: HQ119014: Did not receive data from invm:0. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check user manual for more information. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
WARN [org.hornetq.jms.server] (Thread-3 (HornetQ-client-global-threads-451033388)) Notified of connection failure in xa discovery, we will retry on the next recovery: HornetQException[errorType=NOT_CONNECTED message=HQ119006: Channel disconnected]
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
Any idea?
Something is going wrong with a client's connection to the server.
These warnings are generated from the server which determines that a client is no longer responding.When that happens, the server cleans up the server-side resources related to the client's connections.
I have seen this many times by clients which are not properly coded to close their resources when they exit (finally block).
Have a look also for network problems which break the connection.

Handling connection failures in apache-camel

I am writing an apache-camel RabbitMQ consumer. I would like to react somehow to connection problems (i.e. try to reconnect). Is it possible to configure apache-camel to automatically reconnect?
If not, how can I find out that a connection to the queue was interrupted? I've done the following test:
start the queue (and some producer)
start my consumer (it was getting messages as expected)
stop the queue (the messages stopped arriving, as expected, but no exception was thrown)
start the queue (no new messages were received)
I am using camel in Scala (via akka-camel), but a Java solution would be probably also OK
You can pass in the flag automaticRecoveryEnabled=true to the URI, Camel will reconnect if the connection is lost.
For automatic RabbitMQ resource recovery (Connections/Channels/Consumers/Queues/Exchanages/Bindings) when failures occur, check out Lyra (which I authored). Example usage:
Config config = new Config()
.withRecoveryPolicy(new RecoveryPolicy()
.withMaxAttempts(20)
.withInterval(Duration.seconds(1))
.withMaxDuration(Duration.minutes(5)));
ConnectionOptions options = new ConnectionOptions().withHost("localhost");
Connection connection = Connections.create(options, config);
The rest of the API is just the amqp-client API, except your resources are automatically recovered when failures occur.
I'm not sure about camel-rabbitmq specifically, but hopefully there's a way you can swap in your own resource creation via Lyra.
Current camel-rabbitmq just create a connection and the channel when the consumer or producer is started. So it don't have a chance to catch the connection exception :(.

High tomcat thread count caused by threads waiting on MultiThreadedHttpConnectionManager ConnectionPool

We've recently started seeing spikes in the thread counts on our tomcat servers (peaking at over 1000 when normally at around 100). We performed a thread dump on one of the tomcat servers whilst it's thread count was high and found that a large number of the threads were waiting on MultiThreadedHttpConnectionManager$ConnectionPool, stack trace as follows:
"TP-Processor21700" daemon prio=10 tid=0x4a0b3400 nid=0x2091 in Object.wait() [0x399f3000..0x399f4004]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x58ee5030> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked <0x58ee5030> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
...
There are 3 points in our code where httpClient.executeMethod() is called (to obtain info via an http request to another tomcat server). In each case the GetMethod object passed to it has had its socket timeout value set (i.e. via getMethod.getParams().setSoTimeout();) before hand, and the MultiThreadedConnectionManager is configured in spring to have a connectionTimeout value of 10 seconds. One thing I have noticed is that only 2 of the 3 httpClient.executeMethod() invocations are followed by a call to getMethod.releaseConnection(), so I'm wondering if this may be the cause of the problem (i.e. connections not being explicitly released). However what's strange is that
the problem has only started occurring in the last few days and the source code has not been modified for over a year, plus the fact that there has been no recent surge in requests coming through to the tomcat servers. One change that did occur a couple of days before the problem started to occur was that we upgraded the JVM used by the tomcat server from Java 5 (1.5 update 14) to Java 6 (1.6 update 25). We have tried temporarily reverting the JVM version to Java 5 to see if the problem stopped occurring but it did not. Another point to note is that in most cases the tomcat server eventually recovers and
the thread count drops back to normal - we've only had one instance where a tomcat process appears to have crashed because of the thread count increase.
We are running Tomcat 5.5 with commons-httpclient-3.1.jar running against a Java 1.6 update 25 on a Red Hat linux environment.
Please let me know if you can suggest any ideas as to what may be the cause of this issue.
Thanks.
The problem was indeed caused by the fact that only 2 of the 3 httpClient.executeMethod(getMethod) invocations were followed by a call to getMethod.releaseConnection(). Ensuring all 3 httpClient.executeMethod(getMethod) invocations were inside a try/catch block followed by a finally block containing a call to getMethod.releaseConnection() prevented the high thread counts from occurring. Although this code had been in our live system for over a year, it appears that the reason the high thread count issue only recently started occurring was because various search engine crawlers had started hitting the site with lots of URL requests that caused the code where the connection was being used but not subsequently released to execute. Problem solved.