Apache geode SerialGatewaySenderQueue blocked : There are 86 stuck threads in this node - geode

022-12-20 02:09:13.085 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - org.apache.geode.internal.cache.wan.GatewaySenderAdvisor#31e1b6b8 is becoming primary gateway Sender.
2022-12-20 02:09:13.085 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =false} : Starting as primary
2022-12-20 02:09:13.128 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =false} : Becoming primary gateway sender
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.3] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 5 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.2] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 2 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.4] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.0] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.143 [Pooled Serial Message Processor2-1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - GatewayEventProcessor[gatewaySenderId=sender1;remoteDSId=2;batchSize=100] : Waiting for failover completion
2022-12-20 02:09:13.146 [Event Processor for GatewaySender_sender1.1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 5 events as possible duplicates
2022-12-20 02:09:13.148 [Event Processor for GatewaySender_sender1.3] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 6 events as possible duplicates
2022-12-20 02:09:13.149 [Event Processor for GatewaySender_sender1.4] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 7 events as possible duplicates
2022-12-20 02:09:13.150 [Event Processor for GatewaySender_sender1.2] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 9 events as possible duplicates
2022-12-20 02:09:13.150 [Event Processor for GatewaySender_sender1.0] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 9 events as possible duplicates
2022-12-20 02:09:17.146 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - received leave request from 10.4.20.34(20.34-S1:121257)<v203>:1025 for 10.4.20.34(20.34-S1:121257)<v203>:1025
2022-12-20 02:09:17.147 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - JoinLeave.processMessage(LeaveRequestMessage) invoked. isCoordinator=false; isStopping=false; cancelInProgress=false
2022-12-20 02:09:17.147 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - Checking to see if I should become coordinator. My address is 10.4.20.148(20.148-S1:54029)<v229>:1024
2022-12-20 02:09:17.148 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - View with removed and left members removed is View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] and coordinator would be 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024
2022-12-20 02:09:17.251 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - received leave request from 10.4.20.34(20.34-S1:121257)<v203>:1025 for 10.4.20.34(20.34-S1:121257)<v203>:1025
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - JoinLeave.processMessage(LeaveRequestMessage) invoked. isCoordinator=false; isStopping=false; cancelInProgress=false
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - Checking to see if I should become coordinator. My address is 10.4.20.148(20.148-S1:54029)<v229>:1024
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - View with removed and left members removed is View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] and coordinator would be 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024
2022-12-20 02:09:17.451 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - received new view: View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] shutdown: [10.4.20.34(20.34-S1:121257)<v203>:1025]
old view is: View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|234] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.34(20.34-S1:121257)<v203>:1025{lead}, 10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]
2022-12-20 02:09:17.453 [Geode View Processor1] INFO org.apache.geode.distributed.internal.ClusterDistributionManager - Member at 10.4.20.34(20.34-S1:121257)<v203>:1025 gracefully left the distributed cache: departed membership view
2022-12-20 02:09:17.453 [Geode View Processor1] INFO org.apache.geode.distributed.internal.ClusterOperationExecutors - Marking the SerialQueuedExecutor with id : 3 used by the member 10.4.20.34(20.34-S1:121257)<v203>:1025 to be unused.
2022-12-20 02:09:23.540 [Function Execution Processor12] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:09:27.880 [Function Execution Processor11] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:09:31.419 [Thread-7] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11238 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.419 [Thread-8] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11239 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.421 [Thread-10] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11240 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.424 [Thread-9] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11241 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.427 [Thread-6] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11242 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2
2022-12-20 02:09:42.518 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.170.107(15:loner):49844:f703d5d6,connection=2; port=45420] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.518 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.28.152(15:loner):45140:3406d5d6,connection=2; port=41590] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - ClientHealthMonitor: Unregistering client with member id identity(10.255.28.152(15:loner):45140:3406d5d6,connection=2 due to: Unknown reason
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.242.77(13:loner):48228:1706d5d6,connection=2; port=57978] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - ClientHealthMonitor: Unregistering client with member id identity(10.255.242.77(13:loner):48228:1706d5d6,connection=2 due to: Unknown reason
2022-12-20 02:09:42.703 [ServerConnection on port 42973 Thread 61] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11246 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:47.522 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.16.178(20:loner):35578:8902d5d6,connection=2; port=60302] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:48.527 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.150.234(39:loner):50436:3406d5d6,connection=2; port=53498] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:51.529 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.95.181(14:loner):42602:2f02d5d6,connection=2; port=53556] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:25.576 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.150.234(39:loner):50436:3406d5d6,connection=2; port=56390] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:26.577 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.16.178(20:loner):35578:8902d5d6,connection=2; port=60832] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:27.392 [Function Execution Processor7] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:10:29.380 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.ThreadsMonitoringProcess - Thread 641 (0x281) is stuck
2022-12-20 02:10:29.396 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.executor.AbstractExecutor - Thread <641> (0x281) that was executed at <20 Dec 2022 02:09:19 CST> has been stuck for <70.145 seconds> and number of thread monitor iteration <1>
Thread Name <Pooled Serial Message Processor1-1> state <WAITING>
Waiting on <java.util.concurrent.locks.ReentrantReadWriteLock$FairSync#1ce290a7>
Owned By <ServerConnection on port 42973 Thread 63> with ID <930>
Executor Group <SerialQueuedExecutorWithDMStats>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:223)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5777)
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$258/445977285.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$256/2071190751.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$255/113991150.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2044)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5602)
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:387)
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5573)
org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:156)
org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)
org.apache.geode.internal.cache.DistributedPutAllOperation$PutAllMessage.doEntryPut(DistributedPutAllOperation.java:1114)
org.apache.geode.internal.cache.DistributedPutAllOperation$PutAllMessage$1.run(DistributedPutAllOperation.java:1194)
org.apache.geode.internal.cache.event.DistributedEventTracker.syncBulkOp(DistributedEventTracker.java:481)
Lock owner thread stack
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)
org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)
org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)
org.apache.geode.internal.cache.DistributedCacheOperation.distribute(DistributedCacheOperation.java:318)
org.apache.geode.internal.cache.DistributedRegion.distributeUpdate(DistributedRegion.java:514)
org.apache.geode.internal.cache.DistributedRegion.basicPutPart3(DistributedRegion.java:492)
org.apache.geode.internal.cache.map.RegionMapPut.doAfterCompletionActions(RegionMapPut.java:307)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:185)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$255/113991150.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2044)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5602)
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:387)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue$SerialGatewaySenderQueueMetaRegion.virtualPut(SerialGatewaySenderQueue.java:1215)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5580)
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:156)
org.apache.geode.internal.cache.LocalRegion.basicPut(LocalRegion.java:5038)
org.apache.geode.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1637)
org.apache.geode.internal.cache.LocalRegion.put(LocalRegion.java:1624)
org.apache.geode.internal.cache.AbstractRegion.put(AbstractRegion.java:442)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:245)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:232)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
2022-12-20 02:10:29.397 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.ThreadsMonitoringProcess - Thread 646 (0x286) is stuck
2022-12-20 02:10:29.413 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.executor.AbstractExecutor - Thread <646> (0x286) that was executed at <20 Dec 2022 02:09:16 CST> has been stuck for <72.891 seconds> and number of thread monitor iteration <1>
Thread Name <Pooled Serial Message Processor2-1> state <WAITING>
Waiting on <java.util.concurrent.locks.ReentrantReadWriteLock$FairSync#1ce290a7>
Owned By <ServerConnection on port 42973 Thread 63> with ID <930>
Executor Group <SerialQueuedExecutorWithDMStats>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:223)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5777)
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
There are 86 stuck threads in this node
The cluster uses WAN gateway, which is a two-way data transmission architecture.
When I restart the main sending gateway, the cluster fails, the cluster cannot write data, there are threads stuck, over time, finally 86 threads are stuck, the cluster after 57 minutes, all the stuck threads are restored, during this period has been reporting exceptions in the log, can not be recovered.
Hope to know what causes this and find a quick way to recover.

Related

Losing connection to Kafka. What happens?

A jobmanager and taskmanager are running on a single VM. Also Kafka runs on the same server.
I have 10 tasks, all read from different kafka topics , process messages and write back to Kafka.
Sometimes I find my task manager is down and nothing is working. I tried to figure out the problem by checking the logs and I believe it is a problem with Kafka connection. (Or maybe a network problem?. But everything is on a single server.)
What I want to ask is, if for a short period I lose connection to Kafka what happens. Why tasks are failing and most importantly why task manager crushes?
Some logs:
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-8] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,692 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=telefilter1-0, groupId=telefilter1] Cancelled in-flight FETCH request with correlation id 3630156 due to node 0 being disconnected (elapsed time since creation: 61648ms, elapsed time since send: 61648ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159429 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Cancelled in-flight FETCH request with correlation id 2344708 due to node 0 being disconnected (elapsed time since creation: 51184ms, elapsed time since send: 51184ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159430 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-15] Received invalid metadata error in produce request on partition tele.alerts.cpu-4 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-8] Received invalid metadata error in produce request on partition tele.alerts.cpu-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2
and then
2022-11-26 23:35:56,673 WARN org.apache.flink.runtime.taskmanager.Task [] - CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
...
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
2022-11-26 23:35:56,682 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0).
2022-11-26 23:35:57,199 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0).
2022-11-26 23:35:57,202 WARN org.apache.flink.runtime.taskmanager.Task [] - TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
Why taskexecutor loses connection to JobManager?
If I dont care any data lost, how should I configure Kafka clients and flink recovery. I just want Kafka Client not to die. Especially I dont want my tasks or task managers to crush. If I lose connection, is it possible to configure Flink to just for wait? If we can`t read, wait and if we can't write back to Kafka, just wait?
The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
Sounds like the server is somewhat overloaded. But you could try increasing the heartbeat timeout.

Kafka Zookeeper Random Restarts

We are running Hyperledger fabric network with Kafka and zookeeper in production using docker swarm on Azure VM (4 Kafka node, 3 zookeeper nodes) it was running fine but just 2 days back suddenly zookeeper had a restart, after that there's continuous restart on zookeeper having time interval of 6-8 hours.
logs on Kafka node
[2020-07-04 07:48:53,492] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread)
[2020-07-04 07:48:53,492] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread)
[2020-07-04 07:48:53,499] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions xxxx-xxxxx-xxx-xxxxx.
zookeeper leader logs
2020-07-04 07:46:27,070 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor#653] - Got user-level KeeperException when processing sessionid:0x10101beb22c0000 type:create cxid:0x4 zxid:0x2e00000114 txntype:-1 reqpath:n/a Error Path:/brokers/ids Error:KeeperErrorCode = NodeExists for /brokers/ids
2020-07-04 07:48:43,084 [myid:3] - INFO [SessionTracker:ZooKeeperServer#355] - Expiring session 0x2010551ef290000, timeout of 6000ms exceeded
2020-07-04 07:48:43,085 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor#487] - Processed session termination for sessionid: 0x2010551ef290000
2020-07-04 07:48:43,091 [myid:3] - INFO [CommitProcessor:3:NIOServerCnxn#1056] - Closed socket connection for client /100.0.20.80:60672 which had sessionid 0x2010551ef290000
2020-07-04 07:48:55,182 [myid:3] - ERROR [LearnerHandler-/100.0.20.80:58940:LearnerHandler#648] - Unexpected exception causing shutdown while sock still open
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559)
2020-07-04 07:48:55,183 [myid:3] - WARN [LearnerHandler-/100.0.20.80:58940:LearnerHandler#661] - ******* GOODBYE /100.0.20.80:58940 ********
2020-07-04 07:49:57,623 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#222] - Accepted socket connection from /100.0.20.80:37838
2020-07-04 07:49:57,637 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#949] - Client attempting to establish new session at /100.0.20.80:37838
2020-07-04 07:49:57,641 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer#694] - Established session 0x300ed4720900000 with negotiated timeout 12000 for client /100.0.20.80:37838
2020-07-04 07:49:57,670 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor#653] - Got user-level KeeperException when processing sessionid:0x300ed4720900000 type:setData cxid:0x1 zxid:0x2e000003b2 txntype:-1 reqpath:n/a Error Path:/brokers/topics/xxxxxxxxxxxx/partitions/0/state Error:KeeperErrorCode = BadVersion for /brokers/topics/xxxxxxxxxxxx/partitions/0/state
my zoo.cfg
clientPort=2181
dataDir=/data
dataLogDir=/datalog
tickTime=6000
initLimit=10
syncLimit=2
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
server.1=xxx.xxx.com:2888:3888
server.2=xxx.xxx.com:2888:3888
server.3=0.0.0.0:2888:3888

Zookeeper unable to talk to new Kafka broker

In an attempt to reduce the storage on my AWS instance I decided to launch a new, smaller instance and setup Kafka again from scratch using the Ansible playbook we had from before. I then terminated the old, larger instance and took its IP address that it and the other brokers were using and put it on my new instance.
When tailing my Zookeeper logs however I'm receiving this error -
2018-04-13 14:17:34,884 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#810] - Connection broken for id 1, my id = 2, error =
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:153)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.net.SocketInputStream.read(SocketInputStream.java:211)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795)
2018-04-13 14:17:34,885 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#813] - Interrupting SendWorker
2018-04-13 14:17:34,884 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker#727] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
I double checked and all 3 Kafka broker IP addresses are correctly listed in these location and I restarted all their services to be safe.
/etc/hosts
/etc/kafka/config/server.properties
/etc/zookeeper/conf/zoo.cfg
/etc/filebeat/filebeat.yml

Kafka Zookeeper Connection drop continuously

I have setup Kafka 3-node cluster and Zookeeper 3-node cluster, on separate nodes. Using Kafka I can produce and consume messages successfully and run commands like kafka-topic.sh to get topic lists and their informations from Zookeeper, but there are some errors on Kafka server.log file. The following warning appears continuously:
[2018-02-18 21:50:01,241] WARN Client session timed out, have not heard from server in 320190154ms for sessionid 0x161a94b101f0001 (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:01,242] INFO Client session timed out, have not heard from server in 320190154ms for sessionid 0x161a94b101f0001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:01,343] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient)
[2018-02-18 21:50:01,989] INFO Opening socket connection to server zookeeper3/192.168.1.206:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:02,008] INFO Socket connection established to zookeeper3/192.168.1.206:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:02,042] INFO Session establishment complete on server zookeeper3/192.168.1.206:2181, sessionid = 0x161a94b101f0001, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2018-02-18 21:50:02,042] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
[2018-02-18 21:59:31,570] INFO [Group Metadata Manager on Broker 102]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager)
It seems the Kafka sessions in zookeeper expires periodically!
In Zookeeper logs are the following warninngs, too:
2018-02-18 18:20:06,149 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#368] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x161a94b101f0001, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:748)
2018-02-18 18:20:06,151 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /192.168.1.203:43162 which had sessionid 0x161a94b101f0001
2018-02-18 18:20:06,781 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#368] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x161a94b101f0002, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:748)
2018-02-18 18:20:06,782 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /192.168.1.201:45330 which had sessionid 0x161a94b101f0002
2018-02-18 18:37:29,127 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#192] - Accepted socket connection from /192.168.1.202:52480
2018-02-18 18:37:29,139 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#942] - Client attempting to establish new session at /192.168.1.202:52480
2018-02-18 18:37:29,143 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer#687] - Established session 0x161a94b101f0003 with negotiated timeout 30000 for client /192.168.1.202:52480
2018-02-18 18:37:29,432 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /192.168.1.202:52480 which had sessionid 0x161a94b101f0003
I think it's because zookeeper can't get heartbeat from Kafka nodes. The followings are Zookeeper zoo.cfg:
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2
server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888
and Kafka server.properties customized setting:
broker.id=1
listeners = PLAINTEXT://kafka1:9092
num.partitions=24
delete.topic.enable=true
default.replication.factor=2
log.dirs=/data/kafka/data
zookeeper.connect=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
log.retention.hours=168
I use the same zookeeper cluster for Hadoop HA without any problem. I think there is something wrong with the Kafka properties listeners and advertised.listeners. I read the Kafka documentation but couldn't understand their meaning.
In the host file of all OSes, hostnames such that zookeeper1 to zookeeper3 and kafka1 to kafka3 are defined and reachable through ping command. I removed the following lines from hosts:
127.0.0.1 localhost
127.0.1.1 hostname
I think it couldn't cause the problem.
Kafka version: 0.11
Zookeeper version: 3.4.10
Can anyone help?
We were facing a similar issue with Kafka. As #Soheil pointed out it was due to a Major GC running.
When a Major GC runs, then Kafka would sometimes not be able to send heartbeat to zookeeper. For us the Major GC was running almost once every 15 sec. On taking a heap dump, we realized it was due to a Metric Memory Leak in Kafka.

Zookeeper sessions keep expiring...no heartbeats?

We are using Kafka high level consumer , and we are able to successfully consume messages but the zookeeper connections keep expiring and reestablishing.
I am wondering why are there no heartbeats to keep the connections alive:
Kafka Consumer Logs
====================
[localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 1096 ] - Client session timed out, have not heard from server in 2666ms for sessionid 0x153175bd3860159, closing socket connection and attempting reconnect
2016-03-08 18:00:06,750 INFO [localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 975 ] - Opening socket connection to server 10.41.105.23/10.41.105.23:2181. Will not attempt to authenticate using SASL (unknown error)
2016-03-08 18:00:06,823 INFO [localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 852 ] - Socket connection established to 10.41.105.23/10.41.105.23:2181, initiating session
2016-03-08 18:00:06,892 INFO [localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 1235 ] - Session establishment complete on server 10.41.105.23/10.41.105.23:2181, sessionid = 0x153175bd3860159, negotiated timeout = 4000
Zookeeper Logs
==================
[2016-03-08 17:44:37,722] INFO Accepted socket connection from /10.10.113.92:51333 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2016-03-08 17:44:37,742] INFO Client attempting to renew session 0x153175bd3860159 at /10.10.113.92:51333 (org.apache.zookeeper.server.ZooKeeperServer)
[2016-03-08 17:44:37,742] INFO Established session 0x153175bd3860159 with negotiated timeout 4000 for client /10.10.113.92:51333 (org.apache.zookeeper.server.ZooKeeperServer)
[2016-03-08 17:46:56,000] INFO Expiring session 0x153175bd3860151, timeout of 4000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2016-03-08 17:46:56,001] INFO Processed session termination for sessionid: 0x153175bd3860151 (org.apache.zookeeper.server.PrepRequestProcessor)
[2016-03-08 17:46:56,011] INFO Closed socket connection for client /10.10.114.183:38324 which had sessionid 0x153175bd3860151 (org.apache.zookeeper.server.NIOServerCnxn)
Often ZooKeeper session timeouts are caused by "soft failures," which are most commonly a garbage collection pause. Turn on GC logging and see if a long GC occurs at the time the connection times out. Also, read about JVM tuning in Kafka.
[2016-03-08 17:46:56,000] INFO Expiring session 0x153175bd3860151,
timeout of 4000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
What is Zookeeper's maxSessionTimeout?
If it's just 4000ms (4 seconds), then it's way too small.
In Cloudera distribution of Hadoop, ZK's maxSessionTimeout is by default 40s
(40000ms).
As explained in ZK configuration -
https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html
it defaults 20 ticks
(and one tick by default is 2 seconds).