spring-kafka java.lang.OutOfMemoryError: GC overhead limit exceeded - apache-kafka

We haven't upgraded the kafka client library version in a while.
kafka-clients-2.0.1.jar
Stacktrace:
2022-01-20 12:06:23,937 ERROR [kafka-coordinator-heartbeat-thread | prod] internals.AbstractCoordinator$HeartbeatThread (AbstractCoordinator.java:1083) - [Consumer clientId=consumer-2, groupId=prod] Heartbeat thread failed due to unexpected error
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3332) ~[?:1.8.0_271]
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) ~[?:1.8.0_271]
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) ~[?:1.8.0_271]
at java.lang.StringBuilder.append(StringBuilder.java:136) ~[?:1.8.0_271]
at org.apache.kafka.common.utils.LogContext$AbstractKafkaLogger.addPrefix(LogContext.java:66) ~[kafka-clients-2.0.1.jar:?]
at org.apache.kafka.common.utils.LogContext$LocationAwareKafkaLogger.writeLog(LogContext.java:434) [kafka-clients-2.0.1.jar:?]
at org.apache.kafka.common.utils.LogContext$LocationAwareKafkaLogger.info(LogContext.java:382) [kafka-clients-2.0.1.jar:?]
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.markCoordinatorUnknown(AbstractCoordinator.java:729) ~[kafka-clients-2.0.1.jar:?]
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.markCoordinatorUnknown(AbstractCoordinator.java:724) ~[kafka-clients-2.0.1.jar:?]
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread.run(AbstractCoordinator.java:1031) [kafka-clients-2.0.1.jar:?]
Could there be a memory leak in this version of kafka-clients? Is an upgrade needed?
In the redundant tomcat server (which become primary since the primary server crashed due to OOM), we noticed that the number of blocked-count on the kafka-consumer threads is really huge in the order 4580000, the duration of this JVM is ~5 days 12 h. We got these numbers from Java Mission Control.
Is it normal to see such numbers for the blocked-count?

Related

Why does the message "The Critical Analyzer detected slow paths on the broker" mean in Artemis broker?

Setup: I have an artemis broker HA cluster with 3 brokers. The replication policy is replication. Each broker is running in its own VM.
Problem: When I leave my brokers running for long time, usually after 5-6 hours, I get the below error.
2022-11-21 21:32:37,902 WARN
[org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager
is expired on path 0 2022-11-21 21:32:37,902 INFO
[org.apache.activemq.artemis.core.server] AMQ224107: The Critical
Analyzer detected slow paths on the broker. It is recommended that
you enable trace logs on org.apache.activemq.artemis.utils.critical
while you troubleshoot this issue. You should disable the trace logs
when you have finished troubleshooting. 2022-11-21 21:32:37,902 ERROR
[org.apache.activemq.artemis.core.server] AMQ224079: The process for
the virtual machine will be killed, as component
org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager#46d59067
is not responsive 2022-11-21 21:32:37,969 WARN
[org.apache.activemq.artemis.core.server] AMQ222199: Thread dump:
******************************************************************************* Complete Thread dump "Thread-517
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7#437da279)"
Id=602 TIMED_WAITING on
java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.SynchronousQueue$TransferStack#75f49105
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
at java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
What does this really mean? I understand that the critical analyzer sees an error and it halts the broker but what is causing this error?
You may take a look at the documentation. Basically you are experiencing some issue tat the broker detects and it shuts down before it becomes too irresponsive. Setting the policy to LOG you might get more clues on the issue.

Large volume of scheduled messages seem to get stuck on ActiveMQ Artemis broker

I am using Apache ArtemisMQ 2.17.0 to store a few million scheduled messages. Due to the volume of messages paging is
triggered and almost half of the messages are stored on shared filesystem (master-slave shared filesystem store (NFSv4) ha topology).
These messages are scheduled every X hours and each "batch" is around 500k messages with the size of each individual
message a bit larger than 1KB.
In essence my use case dictates at some point near midnight to produce 4-5 million of messages which are scheduled to leave next day as bunches in predefined scheduled periods (e.g. 11a.m., 3 p.m., 6p.m.). Those messages produced are not ordered by scheduled time as messages for timeslot 6p.m. can be written to the queue earlier from other messages and therefore scheduled messages can be interleaved in order. Also since the volume of messages is pretty large I
can witness that address memory used is maxing out and multiple files are created on the paging directory for the queue.
My issue appears when my jms application starts to consume messages from the specified queue and though it starts to
consume data very fast at some point it blocks and becomes non responsive. When I check the broker's logs I can see the
following:
2021-03-31 15:26:03,520 WARN [org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component org.apache.activemq.artemis.core.server.impl.QueueImpl
is expired on path 3: java.lang.Exception: entered
at org.apache.activemq.artemis.utils.critical.CriticalMeasure.enterCritical(CriticalMeasure.java:56) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.critical.CriticalComponentImpl.enterCritical(CriticalComponentImpl.java:52) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addConsumer(QueueImpl.java:1403) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.<init>(ServerConsumerImpl.java:262) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.createConsumer(ServerSessionImpl.java:569) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:328) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:292) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_262]
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.17.0.jar:2.17.0]
2021-03-31 15:26:03,525 ERROR [org.apache.activemq.artemis.core.server] AMQ224079: The process for the virtual machine will be killed, as component
QueueImpl[name=my-queue, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=f3fddf74-9212-11eb-9a18-005056b570b4],
temp=false]#5a4be15a is not responsive
2021-03-31 15:26:03,980 WARN [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: **********
The broker halts and the slave broker becomes alive but the messages scheduled are still hanging on the queue.
When restarting the master broker I can see some logs like these below
2021-03-31 15:59:41,810 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session f558ac8f-9220-11eb-98a4-005056b5d5f6
2021-03-31 15:59:41,814 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /ip-app:52922 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222172: Queue my-queue was busy for more than 10,000 milliseconds. There are possibly consumers hanging on a network operation
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222144: Queue could not finish waiting executors. Try increasing the thread pool size
Taking a look at cpu and memory metrics I do not see anything unusual since CPU at the time of consuming is less than 50% of the max load and memory of the broker host is also at the same levels (60% used). I/O is rather insignificant, but what may be helpful is that the number of blocking threads has a sharp increase just before that error (0 -> 40). Also heap memory is maxed out but I do not see any GC out of the ordinary as far as I can tell.
This figure is after reproducing it for messages scheduled to leave at 2:30p.m.
Also part of thread dump showing blocked and timed_waiting threads
"Thread-2 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=44 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:112)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:45)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
"Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=43 BLOCKED on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c owned by "Thread-3 (ActiveMQ-scheduled-threads)" Id=24
at org.apache.activemq.artemis.core.server.impl.RefsOperation.afterCommit(RefsOperation.java:182)
- blocked on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:579)
- locked org.apache.activemq.artemis.core.transaction.impl.TransactionImpl#26fb9cb9
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.access$100(TransactionImpl.java:40)
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:322)
at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:279)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$30/1259174396.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#535779e4
"Thread-3 (ActiveMQ-scheduled-threads)" Id=24 RUNNABLE
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:143)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:98)
- locked org.apache.activemq.artemis.core.io.nio.NIOSequentialFile#520b145f
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.openPage(PageReader.java:114)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:83)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:105)
- locked org.apache.activemq.artemis.core.paging.cursor.impl.PageReader#669a8420
at org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.getMessage(PageCursorProviderImpl.java:151)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl.queryMessage(PageSubscriptionImpl.java:634)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getPagedMessage(PagedReferenceImpl.java:132)
- locked org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl#3bfc8d39
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessage(PagedReferenceImpl.java:99)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessageMemoryEstimate(PagedReferenceImpl.java:186)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.internalAddHead(QueueImpl.java:2839)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1102)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1138)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.ScheduledDeliveryHandlerImpl$ScheduledDeliveryRunnable.run(ScheduledDeliveryHandlerImpl.java:264)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#11f0a5a1
Note also that I did try increasing the memory resources on the broker so as to avoid triggering paging messages on disk and doing so made the problem disappear, but since my message volume is going to be erratic I do not see that as a long term solution.
Can you give me any pointers how to resolve this issue? How can I cope with large volumes of paged data stored in the broker that need
to be released at large chunks to consumers ?
Edit: After increasing number of scheduled threads
After using an increased number of scheduled threads critical analyzer did not terminate the broker but I got constant warnings like the ones below
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4606893a-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,838 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,840 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,855 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,864 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
And traffic on my consumer side had spike and dips as shown in the following figure
which essentially crippled throughput. Note that more than 80% percent of the messages were already in memory and only a small portion was paged on disk.
I think the two most important things for your use-case are going to be:
Avoid paging. Paging is a palliative measure meant to be used as a last resort to keep the broker functioning. If at all possible you should configure your broker to handle your load without paging (e.g. acquire more RAM, allocate more heap). It's worth noting that the broker is not designed like a database. It is designed for messages to flow through it. It can certainly buffer messages (potentially millions depending on the configuration & hardware) but when its forced to page the performance will drop substantially simply because disk is orders of magnitude slower than RAM.
Increase scheduled-thread-pool-max-size. Dumping this many scheduled messages on the broker is going to put tremendous pressure on the scheduled thread pool. The default size is only 5. I suggest you increase that until you stop seeing performance benefits.

Kafka cluster streams timeouts at high input

I'm running an Kafka cluster with 7 nodes and a lot of stream processing. Now I see infrequent errors in my Kafka Streams applications like at high input rates:
[2018-07-23 14:44:24,351] ERROR task [0_5] Error sending record to topic topic-name. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl) org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,021] ERROR stream-thread [StreamThread-2] Failed to commit StreamTask 0_5 state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,033] ERROR stream-thread [StreamThread-2] Failed while executing StreamTask 0_5 due to flush state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:423) at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555) at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501) at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551) at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449) at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,039] WARN stream-thread [StreamThread-2] Unexpected state transition from RUNNING to NOT_RUNNING. (org.apache.kafka.streams.processor.internals.StreamThread) Exception in thread "StreamThread-2" org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
If I reduce the input rate (from 20k to 10k events/s) the errors are gone away. So obviously I'm reaching any sort of limit. I have played around with different options (request.timeout.ms, linger.ms and batch.size) but every time the same result.
You seem to have reached some kind of limit. Based on the message 60060 ms has passed since last append I'd assume it's writher thread starvation due to high load, so disk would be the first thing to check:
disk usage - if you're reaching write speed limit, switching from hdd to ssd might help
load distribution - is your traffic split +- equally to all nodes?
CPU load - lots of processing can
we had similar issue.
in our case we had the following configuration for replication and acknowledgement:
replication.factor: 3
producer.acks: all
and under high load the same error occurred multiple times TimeoutException: Expiring N record(s) for topic: N ms has passed since last append.
after removing our custom replication.factor and producer.acks configs (so we now using default values), and this error has disapearred.
Definitely it takes much more time on producer side until leader will receive full set of in-sync replicas to acknowledge the record, and until records replicated with specified replication.factor.
You will be slightly less protected on fault tolerance with default values.
Also potentially consider to increase the number of partitions per topic and number of application nodes (in which your kafka stream logic processed).

Hazelcast Multicasting error after stopping the nodes of the cluster

I am having a cluster of two nodes i.e. two OrientDB servers running on two separate machines having the enterprise edition 2.2.3 .Both the machines are VM having fedora OS 18. The orientDB database consists of approximately 75000 edges and 5000 nodes.
When i try to stop any of the nodes or both the nodes one after other i am having following error:
Node1
2017-05-02 17:32:44:811 WARNI Received signal: SIGINT [OSignalHandler]Exception in thread "Timer-1" com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
at com.hazelcast.spi.AbstractDistributedObject.throwNotActiveException(AbstractDistributedObject.java:85)
at com.hazelcast.spi.AbstractDistributedObject.lifecycleCheck(AbstractDistributedObject.java:80)
at com.hazelcast.spi.AbstractDistributedObject.getNodeEngine(AbstractDistributedObject.java:74)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:309)
at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:250)
at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:94)
at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedMap.get(OHazelcastDistributedMap.java:53)
at com.orientechnologies.agent.profiler.OEnterpriseProfiler$14.run(OEnterpriseProfiler.java:772)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid11478.hprof ...
Heap dump file created [744789648 bytes in 21.248 secs]
Node2
2017-05-02 17:32:41:108 INFO [192.168.6.153]:2434 [orientdb] [3.6.3] Running shutdown hook... Current state: ACTIVE [Node]Exception in thread "Timer-1" com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
at com.hazelcast.spi.AbstractDistributedObject.throwNotActiveException(AbstractDistributedObject.java:85)
at com.hazelcast.spi.AbstractDistributedObject.lifecycleCheck(AbstractDistributedObject.java:80)
at com.hazelcast.spi.AbstractDistributedObject.getNodeEngine(AbstractDistributedObject.java:74)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:309)
at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:250)
at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:94)
at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedMap.get(OHazelcastDistributedMap.java:53)
at com.orientechnologies.agent.profiler.OEnterpriseProfiler$14.run(OEnterpriseProfiler.java:772)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
How can i solve the heap memory issue?
Seems like your problem is the Out of Memory error. The exception from Hazelcast just means that the HazelcastInstance was stopped, most probably based on the OOME fact.

memory error in standalone spark cluster as "shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]"

I got the following memory error in my standalone spark cluster, after 140 iterations of my code. How shall I run my code without memory fault?
I am having 7 nodes with 8GB RAM out of which 6GB is allocated to all the workers. The master is also having 8GB RAM.
[error] application - Remote calculator (Actor[akka.tcp://Remote#127.0.0.1:44545/remote/akka.tcp/NotebookServer#127.0.0.1:50778/user/$c/$a#872469007]) has been terminated !!!!!
[info] application - View notebook 'kamaruddin/PSOAANN_BreastCancer_optimized.snb', presentation: 'None'
[info] application - Closing websockets for kernel 6c8e8090-cbeb-430e-9d45-5710ce60b984
Uncaught error from thread [Remote-akka.actor.default-dispatcher-6] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
Exception in thread "Thread-36" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.jar.Attributes.read(Attributes.java:394)
at java.util.jar.Manifest.read(Manifest.java:199)
at java.util.jar.Manifest.<init>(Manifest.java:69)
at java.util.jar.JarFile.getManifestFromReference(JarFile.java:186)
at java.util.jar.JarFile.getManifest(JarFile.java:167)
at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:779)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:416)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.bindError(SparkIMain.scala:1041)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1347)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at notebook.kernel.Repl$$anonfun$3.apply(Repl.scala:173)
at notebook.kernel.Repl$$anonfun$3.apply(Repl.scala:173)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at scala.Console$.withOut(Console.scala:126)
at notebook.kernel.Repl.evaluate(Repl.scala:172)
at notebook.client.ReplCalculator$$anonfun$10$$anon$1$$anonfun$24.apply(ReplCalculator.scala:364)
at notebook.client.ReplCalculator$$anonfun$10$$anon$1$$anonfun$24.apply(ReplCalculator.scala:361)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
Uncaught error from thread [Remote-akka.remote.default-remote-dispatcher-445] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at java.io.StringWriter.write(StringWriter.java:94)
at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator._flushBuffer(WriterBasedJsonGenerator.java:1879)
at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator._writeString(WriterBasedJsonGenerator.java:916)
at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator._writeFieldName(WriterBasedJsonGenerator.java:213)
at com.fasterxml.jackson.core.json.WriterBasedJsonGenerator.writeFieldName(WriterBasedJsonGenerator.java:104)
at play.api.libs.json.JsValueSerializer$$anonfun$serialize$2.apply(JsValue.scala:319)
at play.api.libs.json.JsValueSerializer$$anonfun$serialize$2.apply(JsValue.scala:318)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at play.api.libs.json.JsValueSerializer.serialize(JsValue.scala:318)
at play.api.libs.json.JsValueSerializer$$anonfun$serialize$1.apply(JsValue.scala:312)
at play.api.libs.json.JsValueSerializer$$anonfun$serialize$1.apply(JsValue.scala:311)
at scala.collection.immutable.List.foreach(List.scala:318)
at play.api.libs.json.JsValueSerializer.serialize(JsValue.scala:311)
at play.api.libs.json.JsValueSerializer$$anonfun$serialize$2.apply(JsValue.scala:320)
at play.api.libs.json.JsValueSerializer$$anonfun$serialize$2.apply(JsValue.scala:318)
at scala.collection.immutable.List.foreach(List.scala:318)
at play.api.libs.json.JsValueSerializer.serialize(JsValue.scala:318)
at play.api.libs.json.JsValueSerializer.serialize(JsValue.scala:302)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:128)
at com.fasterxml.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:1902)
at play.api.libs.json.JacksonJson$.generateFromJsValue(JsValue.scala:494)
at play.api.libs.json.Json$.stringify(Json.scala:51)
at play.api.libs.json.JsValue$class.toString(JsValue.scala:80)
at play.api.libs.json.JsObject.toString(JsValue.scala:166)
at java.util.Formatter$FormatSpecifier.printString(Formatter.java:2838)
at java.util.Formatter$FormatSpecifier.print(Formatter.java:2718)
Uncaught error from thread [Remote-akka.remote.default-remote-dispatcher-446] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[Remote]
java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "appclient-receive-and-reply-threadpool-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "appclient-receive-and-reply-threadpool-2" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "appclient-receive-and-reply-threadpool-4" java.lang.OutOfMemoryError: GC overhead limit exceeded
Exception in thread "appclient-receive-and-reply-threadpool-6" java.lang.OutOfMemoryError: GC overhead limit exceeded
[error] application - Process exited with an error: 255 (Exit value: 255)
org.apache.commons.exec.ExecuteException: Process exited with an error: 255 (Exit value: 255)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
at org.apache.commons.exec.DefaultExecutor.access$200(DefaultExecutor.java:48)
at org.apache.commons.exec.DefaultExecutor$1.run(DefaultExecutor.java:200)
at java.lang.Thread.run(Thread.java:745)
Maybe you can try to use checkpointing.
Data checkpointing - Saving of the generated RDDs to reliable storage.
This is necessary in some stateful transformations that combine data
across multiple batches. In such transformations, the generated RDDs
depend on RDDs of previous batches, which causes the length of the
dependency chain to keep increasing with time. To avoid such unbounded
increases in recovery time (proportional to dependency chain),
intermediate RDDs of stateful transformations are periodically
checkpointed to reliable storage (e.g. HDFS) to cut off the dependency
chain