Why does the message "The Critical Analyzer detected slow paths on the broker" mean in Artemis broker? - activemq-artemis

Setup: I have an artemis broker HA cluster with 3 brokers. The replication policy is replication. Each broker is running in its own VM.
Problem: When I leave my brokers running for long time, usually after 5-6 hours, I get the below error.
2022-11-21 21:32:37,902 WARN
[org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager
is expired on path 0 2022-11-21 21:32:37,902 INFO
[org.apache.activemq.artemis.core.server] AMQ224107: The Critical
Analyzer detected slow paths on the broker. It is recommended that
you enable trace logs on org.apache.activemq.artemis.utils.critical
while you troubleshoot this issue. You should disable the trace logs
when you have finished troubleshooting. 2022-11-21 21:32:37,902 ERROR
[org.apache.activemq.artemis.core.server] AMQ224079: The process for
the virtual machine will be killed, as component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager#46d59067
is not responsive 2022-11-21 21:32:37,969 WARN
[org.apache.activemq.artemis.core.server] AMQ222199: Thread dump:
******************************************************************************* Complete Thread dump "Thread-517
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7#437da279)"
Id=602 TIMED_WAITING on
java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
What does this really mean? I understand that the critical analyzer sees an error and it halts the broker but what is causing this error?

You may take a look at the documentation. Basically you are experiencing some issue tat the broker detects and it shuts down before it becomes too irresponsive. Setting the policy to LOG you might get more clues on the issue.

Related

Large volume of scheduled messages seem to get stuck on ActiveMQ Artemis broker

I am using Apache ArtemisMQ 2.17.0 to store a few million scheduled messages. Due to the volume of messages paging is
triggered and almost half of the messages are stored on shared filesystem (master-slave shared filesystem store (NFSv4) ha topology).
These messages are scheduled every X hours and each "batch" is around 500k messages with the size of each individual
message a bit larger than 1KB.
In essence my use case dictates at some point near midnight to produce 4-5 million of messages which are scheduled to leave next day as bunches in predefined scheduled periods (e.g. 11a.m., 3 p.m., 6p.m.). Those messages produced are not ordered by scheduled time as messages for timeslot 6p.m. can be written to the queue earlier from other messages and therefore scheduled messages can be interleaved in order. Also since the volume of messages is pretty large I
can witness that address memory used is maxing out and multiple files are created on the paging directory for the queue.
My issue appears when my jms application starts to consume messages from the specified queue and though it starts to
consume data very fast at some point it blocks and becomes non responsive. When I check the broker's logs I can see the
following:
2021-03-31 15:26:03,520 WARN [org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component org.apache.activemq.artemis.core.server.impl.QueueImpl
is expired on path 3: java.lang.Exception: entered
at org.apache.activemq.artemis.utils.critical.CriticalMeasure.enterCritical(CriticalMeasure.java:56) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.critical.CriticalComponentImpl.enterCritical(CriticalComponentImpl.java:52) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addConsumer(QueueImpl.java:1403) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.<init>(ServerConsumerImpl.java:262) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.createConsumer(ServerSessionImpl.java:569) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:328) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:292) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_262]
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.17.0.jar:2.17.0]
2021-03-31 15:26:03,525 ERROR [org.apache.activemq.artemis.core.server] AMQ224079: The process for the virtual machine will be killed, as component
QueueImpl[name=my-queue, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=f3fddf74-9212-11eb-9a18-005056b570b4],
temp=false]#5a4be15a is not responsive
2021-03-31 15:26:03,980 WARN [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: **********
The broker halts and the slave broker becomes alive but the messages scheduled are still hanging on the queue.
When restarting the master broker I can see some logs like these below
2021-03-31 15:59:41,810 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session f558ac8f-9220-11eb-98a4-005056b5d5f6
2021-03-31 15:59:41,814 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /ip-app:52922 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222172: Queue my-queue was busy for more than 10,000 milliseconds. There are possibly consumers hanging on a network operation
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222144: Queue could not finish waiting executors. Try increasing the thread pool size
Taking a look at cpu and memory metrics I do not see anything unusual since CPU at the time of consuming is less than 50% of the max load and memory of the broker host is also at the same levels (60% used). I/O is rather insignificant, but what may be helpful is that the number of blocking threads has a sharp increase just before that error (0 -> 40). Also heap memory is maxed out but I do not see any GC out of the ordinary as far as I can tell.
This figure is after reproducing it for messages scheduled to leave at 2:30p.m.
Also part of thread dump showing blocked and timed_waiting threads
"Thread-2 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=44 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:112)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:45)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
"Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=43 BLOCKED on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c owned by "Thread-3 (ActiveMQ-scheduled-threads)" Id=24
at org.apache.activemq.artemis.core.server.impl.RefsOperation.afterCommit(RefsOperation.java:182)
- blocked on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:579)
- locked org.apache.activemq.artemis.core.transaction.impl.TransactionImpl#26fb9cb9
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.access$100(TransactionImpl.java:40)
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:322)
at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:279)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$30/1259174396.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#535779e4
"Thread-3 (ActiveMQ-scheduled-threads)" Id=24 RUNNABLE
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:143)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:98)
- locked org.apache.activemq.artemis.core.io.nio.NIOSequentialFile#520b145f
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.openPage(PageReader.java:114)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:83)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:105)
- locked org.apache.activemq.artemis.core.paging.cursor.impl.PageReader#669a8420
at org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.getMessage(PageCursorProviderImpl.java:151)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl.queryMessage(PageSubscriptionImpl.java:634)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getPagedMessage(PagedReferenceImpl.java:132)
- locked org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl#3bfc8d39
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessage(PagedReferenceImpl.java:99)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessageMemoryEstimate(PagedReferenceImpl.java:186)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.internalAddHead(QueueImpl.java:2839)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1102)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1138)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.ScheduledDeliveryHandlerImpl$ScheduledDeliveryRunnable.run(ScheduledDeliveryHandlerImpl.java:264)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#11f0a5a1
Note also that I did try increasing the memory resources on the broker so as to avoid triggering paging messages on disk and doing so made the problem disappear, but since my message volume is going to be erratic I do not see that as a long term solution.
Can you give me any pointers how to resolve this issue? How can I cope with large volumes of paged data stored in the broker that need
to be released at large chunks to consumers ?
Edit: After increasing number of scheduled threads
After using an increased number of scheduled threads critical analyzer did not terminate the broker but I got constant warnings like the ones below
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4606893a-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,838 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,840 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,855 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,864 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
And traffic on my consumer side had spike and dips as shown in the following figure
which essentially crippled throughput. Note that more than 80% percent of the messages were already in memory and only a small portion was paged on disk.
I think the two most important things for your use-case are going to be:
Avoid paging. Paging is a palliative measure meant to be used as a last resort to keep the broker functioning. If at all possible you should configure your broker to handle your load without paging (e.g. acquire more RAM, allocate more heap). It's worth noting that the broker is not designed like a database. It is designed for messages to flow through it. It can certainly buffer messages (potentially millions depending on the configuration & hardware) but when its forced to page the performance will drop substantially simply because disk is orders of magnitude slower than RAM.
Increase scheduled-thread-pool-max-size. Dumping this many scheduled messages on the broker is going to put tremendous pressure on the scheduled thread pool. The default size is only 5. I suggest you increase that until you stop seeing performance benefits.

Cannot start Zookeeper due to: Exception causing close of session 0x0 due to java.io.EOFException

I am trying to start up Zookeeper via the CLI with the command:
bin/zookeeper-server-start.sh ../config/zookeeper.properties
And it hums along for a second with what all seems to be correct until it says this:
INFO binding to port 0.0.0.0/0.0.0.0:2181 (org.apache.zookeeper.server.NIOServerCnxnFactory)
and then the below loops indefinitely until I exit:
[2018-08-10 15:07:48,223] INFO Accepted socket connection from /172.31.39.32:46374 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2018-08-10 15:07:48,228] WARN Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running (org.apache.zookeeper.server.NIOServerCnxn)
[2018-08-10 15:07:48,228] INFO Closed socket connection for client /172.31.39.32:46374 (no session established for client) (org.apache.zookeeper.server.NIOServerCnxn)
This is a single server and I believe a single node test server, so there isn't a quorum or other pieces running. My zookeeper config is basic, it only contains this:
dataDir=/tmp/zookeeper
clientPort=2181
maxClientCnxns=0
The weird thing is, my zookeeper had been running fine, and I had made NO changes to the config. Pulled it down to try to fix something else to do a quick restart on the zookeeper, and it won't budge. I've checked, and nothing else is running on port 2181.
I see this question has been asked several times with no answers, any ideas?
This might be happening because of some corruption in zookeeper data. You should not set dataDir to /tmp/*. If your computer purges some data of /tmp, it will be difficult for zookeeper to restore the state upon restart. If you check the zookeeper logs, you should see some kind of exception there.
Since you mentioned this zookeeper instance is for test purpose only. You should set
dataDir to anything but /tmp and try restart.

Root cause for Connection broken for id 1, my id = 3, error =

I am using Confluent 4 for kafka and zookeeper installation.
On our Kafka Cluster environment (of 3 brokers and 3 zookeeper nodes running on 3 aws instances)
we are seeing a set of below warnings, repeatedly getting recorded in the broker's server.log file.
We have not observed any functionality issues due to this yet, but we are not able to find the root cause and there may be a chance in future it will affect the clients or other broker nodes. We are not sure yet about this. Below is the set of warnings
[2018-04-03 12:00:40,707] WARN Interrupted while waiting for message on queue (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
[2018-04-03 12:00:40,707] WARN Connection broken for id 1, my id = 3, error = (org.apache.zookeeper.server.quorum.QuorumCnxManager)
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
[2018-04-03 12:00:40,708] WARN Interrupting SendWorker (org.apache.zookeeper.server.quorum.QuorumCnxManager)
[2018-04-03 12:00:40,707] WARN Send worker leaving thread (org.apache.zookeeper.server.quorum.QuorumCnxManager)
This set of warnings get repeated and getting observed in all 3 kafka nodes.
If anyone has any idea about why this warning gets generate, then please let me know.
Thanks in advance.
This sounds like a known issue with newer version of Zk, Check out this JIRA https://issues.apache.org/jira/browse/ZOOKEEPER-2938
In my case, I was replacing a ZK node and the old one was still running which I didn't realize. So I had created 2x nodes with the same "myid".

Apache kafka lot of WARN log show Unexpected error

I was running Kafka with 2 borker for cluster.
But I keep getting the WARN log.
I checked all my systems and there was no host using IP 10.8.7.1.
By the way, there was more IPs looks like from zookeeper or broker ?
If I shotdown on of Kafka, the WARNING log will be less
I am not familiar with Kafka and zookeeper, just getting starting and study
Any ideas?
Kafka version: 1.0.1
WARN log similar as below(get this kind of log about 10 secs),
[2018-04-19 09:13:08,342] WARN [SocketServer brokerId=0] Unexpected error from /10.8.7.1; closing connection (org.apache.kafka.common.network.Selector)
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 369295616 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:235)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:196)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:545)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:483)
at org.apache.kafka.common.network.Selector.poll(Selector.java:412)
at kafka.network.Processor.poll(SocketServer.scala:551)
at kafka.network.Processor.run(SocketServer.scala:468)
at java.lang.Thread.run(Thread.java:748)
One possible cause is that a Kafka producer on 10.8.7.1 is attempting to stream 0.369 GB of data in a batch instead of streaming. You may have to trace down the kafka producer and see whats going.
Hope this helps.

Zookeeper: Connection request from old client will be dropped if server is in r-o mode

storm version: 0.82
zookeeper version: 3.4.5.
We have a small storm cluster (1 nimbus and 3 supervisors), so using just 1 zookeeper instance that's co-located with storm nimbus.
Infrequently we start getting the following errors in the zookeeper logs and our storm cluster comes to a standstill.
2014-04-05 13:27:32,885 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFact
ory#197] - Accepted socket connection from /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#7
93] - Connection request from old client /10.0.1.183:56121; will be dropped if server is in r-o mode
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#8
32] - Client attempting to renew session 0x1452dd02834002e at /10.0.1.183:56121
2014-04-05 13:27:32,886 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#5
95] - Established session 0x1452dd02834002e with negotiated timeout 40000 for client /10.0.1.183:561
21
On the storm end we start seeing the following in supervisor and worker logs:
2014-04-05 11:37:29 ConnectionStateManager [WARN] There are no ConnectionStateListeners registered.
2014-04-05 11:37:29 cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2014-04-05 11:37:31 ClientCnxn [WARN] Session 0x1452dd028340015 for server null, unexpected error,
losing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
2014-04-05 11:37:42 CuratorFrameworkImpl [ERROR] Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(Curat
rFrameworkImpl.java:380)
at com.netflix.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl
java:49)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
Do we need to downgrade zookeeper to 3.3.3 or is there a known issue/config that we're missing?
We also experienced several issues with Storm 0.9 and Zookeeper 3.4.X, even though not exactly the one you describe.
Storm mailing list are also reporting such incompatibility issues:
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/144313a45ba069b5
https://mail.google.com/mail/u/0/#search/label%3Astorm+zookeeper+3.4/1447d95d10ce7582
This later one is pointing us to this Storm pull request, which should hopefully let us use ZK 3.4.X with future versions of Storm when it will be released:
https://github.com/apache/incubator-storm/pull/29
Until then, I would recommend downgrading ZK to 3.3.6 (you may install a specific separate instance of ZK for Storm if you absolutely need ZK 3.4.X for another system). You could also clone the Storm code and merge that pull request locally or compile the latest version of the trunk, but that's a bit adventurous and more tiresome than just waiting for those nice folks to just deliver a new release for us :)
A workaround for this situation is to clear storm's data directory (configured in strom.yaml==>storm.local.dir), then restart the supervisor. I did that in my test environment by clear storm's data directory and restart the nimbus and supervisor.
I think it's caused by a previous crash of the storm cluster, and the supervisor can not recovery from such a spot.