Apache geode SerialGatewaySenderQueue blocked : There are 86 stuck threads in this node - geode
022-12-20 02:09:13.085 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - org.apache.geode.internal.cache.wan.GatewaySenderAdvisor#31e1b6b8 is becoming primary gateway Sender.
2022-12-20 02:09:13.085 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =false} : Starting as primary
2022-12-20 02:09:13.128 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =false} : Becoming primary gateway sender
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.3] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 5 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.2] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 2 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.4] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.0] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.143 [Pooled Serial Message Processor2-1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - GatewayEventProcessor[gatewaySenderId=sender1;remoteDSId=2;batchSize=100] : Waiting for failover completion
2022-12-20 02:09:13.146 [Event Processor for GatewaySender_sender1.1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 5 events as possible duplicates
2022-12-20 02:09:13.148 [Event Processor for GatewaySender_sender1.3] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 6 events as possible duplicates
2022-12-20 02:09:13.149 [Event Processor for GatewaySender_sender1.4] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 7 events as possible duplicates
2022-12-20 02:09:13.150 [Event Processor for GatewaySender_sender1.2] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 9 events as possible duplicates
2022-12-20 02:09:13.150 [Event Processor for GatewaySender_sender1.0] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 9 events as possible duplicates
2022-12-20 02:09:17.146 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - received leave request from 10.4.20.34(20.34-S1:121257)<v203>:1025 for 10.4.20.34(20.34-S1:121257)<v203>:1025
2022-12-20 02:09:17.147 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - JoinLeave.processMessage(LeaveRequestMessage) invoked. isCoordinator=false; isStopping=false; cancelInProgress=false
2022-12-20 02:09:17.147 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - Checking to see if I should become coordinator. My address is 10.4.20.148(20.148-S1:54029)<v229>:1024
2022-12-20 02:09:17.148 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - View with removed and left members removed is View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] and coordinator would be 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024
2022-12-20 02:09:17.251 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - received leave request from 10.4.20.34(20.34-S1:121257)<v203>:1025 for 10.4.20.34(20.34-S1:121257)<v203>:1025
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - JoinLeave.processMessage(LeaveRequestMessage) invoked. isCoordinator=false; isStopping=false; cancelInProgress=false
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - Checking to see if I should become coordinator. My address is 10.4.20.148(20.148-S1:54029)<v229>:1024
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - View with removed and left members removed is View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] and coordinator would be 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024
2022-12-20 02:09:17.451 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - received new view: View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] shutdown: [10.4.20.34(20.34-S1:121257)<v203>:1025]
old view is: View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|234] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.34(20.34-S1:121257)<v203>:1025{lead}, 10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]
2022-12-20 02:09:17.453 [Geode View Processor1] INFO org.apache.geode.distributed.internal.ClusterDistributionManager - Member at 10.4.20.34(20.34-S1:121257)<v203>:1025 gracefully left the distributed cache: departed membership view
2022-12-20 02:09:17.453 [Geode View Processor1] INFO org.apache.geode.distributed.internal.ClusterOperationExecutors - Marking the SerialQueuedExecutor with id : 3 used by the member 10.4.20.34(20.34-S1:121257)<v203>:1025 to be unused.
2022-12-20 02:09:23.540 [Function Execution Processor12] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:09:27.880 [Function Execution Processor11] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:09:31.419 [Thread-7] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11238 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.419 [Thread-8] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11239 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.421 [Thread-10] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11240 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.424 [Thread-9] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11241 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.427 [Thread-6] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11242 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2
2022-12-20 02:09:42.518 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.170.107(15:loner):49844:f703d5d6,connection=2; port=45420] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.518 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.28.152(15:loner):45140:3406d5d6,connection=2; port=41590] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - ClientHealthMonitor: Unregistering client with member id identity(10.255.28.152(15:loner):45140:3406d5d6,connection=2 due to: Unknown reason
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.242.77(13:loner):48228:1706d5d6,connection=2; port=57978] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - ClientHealthMonitor: Unregistering client with member id identity(10.255.242.77(13:loner):48228:1706d5d6,connection=2 due to: Unknown reason
2022-12-20 02:09:42.703 [ServerConnection on port 42973 Thread 61] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11246 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:47.522 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.16.178(20:loner):35578:8902d5d6,connection=2; port=60302] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:48.527 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.150.234(39:loner):50436:3406d5d6,connection=2; port=53498] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:51.529 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.95.181(14:loner):42602:2f02d5d6,connection=2; port=53556] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:25.576 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.150.234(39:loner):50436:3406d5d6,connection=2; port=56390] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:26.577 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.16.178(20:loner):35578:8902d5d6,connection=2; port=60832] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:27.392 [Function Execution Processor7] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:10:29.380 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.ThreadsMonitoringProcess - Thread 641 (0x281) is stuck
2022-12-20 02:10:29.396 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.executor.AbstractExecutor - Thread <641> (0x281) that was executed at <20 Dec 2022 02:09:19 CST> has been stuck for <70.145 seconds> and number of thread monitor iteration <1>
Thread Name <Pooled Serial Message Processor1-1> state <WAITING>
Waiting on <java.util.concurrent.locks.ReentrantReadWriteLock$FairSync#1ce290a7>
Owned By <ServerConnection on port 42973 Thread 63> with ID <930>
Executor Group <SerialQueuedExecutorWithDMStats>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:223)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5777)
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$258/445977285.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$256/2071190751.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$255/113991150.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2044)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5602)
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:387)
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5573)
org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:156)
org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)
org.apache.geode.internal.cache.DistributedPutAllOperation$PutAllMessage.doEntryPut(DistributedPutAllOperation.java:1114)
org.apache.geode.internal.cache.DistributedPutAllOperation$PutAllMessage$1.run(DistributedPutAllOperation.java:1194)
org.apache.geode.internal.cache.event.DistributedEventTracker.syncBulkOp(DistributedEventTracker.java:481)
Lock owner thread stack
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)
org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)
org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)
org.apache.geode.internal.cache.DistributedCacheOperation.distribute(DistributedCacheOperation.java:318)
org.apache.geode.internal.cache.DistributedRegion.distributeUpdate(DistributedRegion.java:514)
org.apache.geode.internal.cache.DistributedRegion.basicPutPart3(DistributedRegion.java:492)
org.apache.geode.internal.cache.map.RegionMapPut.doAfterCompletionActions(RegionMapPut.java:307)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:185)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$255/113991150.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2044)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5602)
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:387)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue$SerialGatewaySenderQueueMetaRegion.virtualPut(SerialGatewaySenderQueue.java:1215)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5580)
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:156)
org.apache.geode.internal.cache.LocalRegion.basicPut(LocalRegion.java:5038)
org.apache.geode.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1637)
org.apache.geode.internal.cache.LocalRegion.put(LocalRegion.java:1624)
org.apache.geode.internal.cache.AbstractRegion.put(AbstractRegion.java:442)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:245)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:232)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
2022-12-20 02:10:29.397 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.ThreadsMonitoringProcess - Thread 646 (0x286) is stuck
2022-12-20 02:10:29.413 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.executor.AbstractExecutor - Thread <646> (0x286) that was executed at <20 Dec 2022 02:09:16 CST> has been stuck for <72.891 seconds> and number of thread monitor iteration <1>
Thread Name <Pooled Serial Message Processor2-1> state <WAITING>
Waiting on <java.util.concurrent.locks.ReentrantReadWriteLock$FairSync#1ce290a7>
Owned By <ServerConnection on port 42973 Thread 63> with ID <930>
Executor Group <SerialQueuedExecutorWithDMStats>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:223)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5777)
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
There are 86 stuck threads in this node
The cluster uses WAN gateway, which is a two-way data transmission architecture.
When I restart the main sending gateway, the cluster fails, the cluster cannot write data, there are threads stuck, over time, finally 86 threads are stuck, the cluster after 57 minutes, all the stuck threads are restored, during this period has been reporting exceptions in the log, can not be recovered.
Hope to know what causes this and find a quick way to recover.
Related
Losing connection to Kafka. What happens?
A jobmanager and taskmanager are running on a single VM. Also Kafka runs on the same server. I have 10 tasks, all read from different kafka topics , process messages and write back to Kafka. Sometimes I find my task manager is down and nothing is working. I tried to figure out the problem by checking the logs and I believe it is a problem with Kafka connection. (Or maybe a network problem?. But everything is on a single server.) What I want to ask is, if for a short period I lose connection to Kafka what happens. Why tasks are failing and most importantly why task manager crushes? Some logs: 2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Disconnecting from node 0 due to request timeout. 2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-8] Disconnecting from node 0 due to request timeout. 2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Disconnecting from node 0 due to request timeout. 2022-11-26 23:35:15,692 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=telefilter1-0, groupId=telefilter1] Cancelled in-flight FETCH request with correlation id 3630156 due to node 0 being disconnected (elapsed time since creation: 61648ms, elapsed time since send: 61648ms, request timeout: 30000ms) 2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159429 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms) 2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Cancelled in-flight FETCH request with correlation id 2344708 due to node 0 being disconnected (elapsed time since creation: 51184ms, elapsed time since send: 51184ms, request timeout: 30000ms) 2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159430 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms) 2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-15] Received invalid metadata error in produce request on partition tele.alerts.cpu-4 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now 2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-8] Received invalid metadata error in produce request on partition tele.alerts.cpu-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now 2 and then 2022-11-26 23:35:56,673 WARN org.apache.flink.runtime.taskmanager.Task [] - CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f. at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702) ... at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out. 2022-11-26 23:35:56,682 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0). 2022-11-26 23:35:57,199 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0). 2022-11-26 23:35:57,202 WARN org.apache.flink.runtime.taskmanager.Task [] - TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f. at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702) Why taskexecutor loses connection to JobManager? If I dont care any data lost, how should I configure Kafka clients and flink recovery. I just want Kafka Client not to die. Especially I dont want my tasks or task managers to crush. If I lose connection, is it possible to configure Flink to just for wait? If we can`t read, wait and if we can't write back to Kafka, just wait?
The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out. Sounds like the server is somewhat overloaded. But you could try increasing the heartbeat timeout.
Kafka Zookeeper Random Restarts
We are running Hyperledger fabric network with Kafka and zookeeper in production using docker swarm on Azure VM (4 Kafka node, 3 zookeeper nodes) it was running fine but just 2 days back suddenly zookeeper had a restart, after that there's continuous restart on zookeeper having time interval of 6-8 hours. logs on Kafka node [2020-07-04 07:48:53,492] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Stopped (kafka.server.ReplicaFetcherThread) [2020-07-04 07:48:53,492] INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=0] Shutdown completed (kafka.server.ReplicaFetcherThread) [2020-07-04 07:48:53,499] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions xxxx-xxxxx-xxx-xxxxx. zookeeper leader logs 2020-07-04 07:46:27,070 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor#653] - Got user-level KeeperException when processing sessionid:0x10101beb22c0000 type:create cxid:0x4 zxid:0x2e00000114 txntype:-1 reqpath:n/a Error Path:/brokers/ids Error:KeeperErrorCode = NodeExists for /brokers/ids 2020-07-04 07:48:43,084 [myid:3] - INFO [SessionTracker:ZooKeeperServer#355] - Expiring session 0x2010551ef290000, timeout of 6000ms exceeded 2020-07-04 07:48:43,085 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor#487] - Processed session termination for sessionid: 0x2010551ef290000 2020-07-04 07:48:43,091 [myid:3] - INFO [CommitProcessor:3:NIOServerCnxn#1056] - Closed socket connection for client /100.0.20.80:60672 which had sessionid 0x2010551ef290000 2020-07-04 07:48:55,182 [myid:3] - ERROR [LearnerHandler-/100.0.20.80:58940:LearnerHandler#648] - Unexpected exception causing shutdown while sock still open java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:559) 2020-07-04 07:48:55,183 [myid:3] - WARN [LearnerHandler-/100.0.20.80:58940:LearnerHandler#661] - ******* GOODBYE /100.0.20.80:58940 ******** 2020-07-04 07:49:57,623 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#222] - Accepted socket connection from /100.0.20.80:37838 2020-07-04 07:49:57,637 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#949] - Client attempting to establish new session at /100.0.20.80:37838 2020-07-04 07:49:57,641 [myid:3] - INFO [CommitProcessor:3:ZooKeeperServer#694] - Established session 0x300ed4720900000 with negotiated timeout 12000 for client /100.0.20.80:37838 2020-07-04 07:49:57,670 [myid:3] - INFO [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor#653] - Got user-level KeeperException when processing sessionid:0x300ed4720900000 type:setData cxid:0x1 zxid:0x2e000003b2 txntype:-1 reqpath:n/a Error Path:/brokers/topics/xxxxxxxxxxxx/partitions/0/state Error:KeeperErrorCode = BadVersion for /brokers/topics/xxxxxxxxxxxx/partitions/0/state my zoo.cfg clientPort=2181 dataDir=/data dataLogDir=/datalog tickTime=6000 initLimit=10 syncLimit=2 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 server.1=xxx.xxx.com:2888:3888 server.2=xxx.xxx.com:2888:3888 server.3=0.0.0.0:2888:3888
Zookeeper unable to talk to new Kafka broker
In an attempt to reduce the storage on my AWS instance I decided to launch a new, smaller instance and setup Kafka again from scratch using the Ansible playbook we had from before. I then terminated the old, larger instance and took its IP address that it and the other brokers were using and put it on my new instance. When tailing my Zookeeper logs however I'm receiving this error - 2018-04-13 14:17:34,884 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#810] - Connection broken for id 1, my id = 2, error = java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:153) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.net.SocketInputStream.read(SocketInputStream.java:211) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795) 2018-04-13 14:17:34,885 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#813] - Interrupting SendWorker 2018-04-13 14:17:34,884 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker#727] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715) I double checked and all 3 Kafka broker IP addresses are correctly listed in these location and I restarted all their services to be safe. /etc/hosts /etc/kafka/config/server.properties /etc/zookeeper/conf/zoo.cfg /etc/filebeat/filebeat.yml
Kafka Zookeeper Connection drop continuously
I have setup Kafka 3-node cluster and Zookeeper 3-node cluster, on separate nodes. Using Kafka I can produce and consume messages successfully and run commands like kafka-topic.sh to get topic lists and their informations from Zookeeper, but there are some errors on Kafka server.log file. The following warning appears continuously: [2018-02-18 21:50:01,241] WARN Client session timed out, have not heard from server in 320190154ms for sessionid 0x161a94b101f0001 (org.apache.zookeeper.ClientCnxn) [2018-02-18 21:50:01,242] INFO Client session timed out, have not heard from server in 320190154ms for sessionid 0x161a94b101f0001, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2018-02-18 21:50:01,343] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2018-02-18 21:50:01,989] INFO Opening socket connection to server zookeeper3/192.168.1.206:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2018-02-18 21:50:02,008] INFO Socket connection established to zookeeper3/192.168.1.206:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2018-02-18 21:50:02,042] INFO Session establishment complete on server zookeeper3/192.168.1.206:2181, sessionid = 0x161a94b101f0001, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2018-02-18 21:50:02,042] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2018-02-18 21:59:31,570] INFO [Group Metadata Manager on Broker 102]: Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) It seems the Kafka sessions in zookeeper expires periodically! In Zookeeper logs are the following warninngs, too: 2018-02-18 18:20:06,149 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#368] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x161a94b101f0001, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203) at java.lang.Thread.run(Thread.java:748) 2018-02-18 18:20:06,151 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /192.168.1.203:43162 which had sessionid 0x161a94b101f0001 2018-02-18 18:20:06,781 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#368] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x161a94b101f0002, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203) at java.lang.Thread.run(Thread.java:748) 2018-02-18 18:20:06,782 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /192.168.1.201:45330 which had sessionid 0x161a94b101f0002 2018-02-18 18:37:29,127 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#192] - Accepted socket connection from /192.168.1.202:52480 2018-02-18 18:37:29,139 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer#942] - Client attempting to establish new session at /192.168.1.202:52480 2018-02-18 18:37:29,143 [myid:1] - INFO [CommitProcessor:1:ZooKeeperServer#687] - Established session 0x161a94b101f0003 with negotiated timeout 30000 for client /192.168.1.202:52480 2018-02-18 18:37:29,432 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /192.168.1.202:52480 which had sessionid 0x161a94b101f0003 I think it's because zookeeper can't get heartbeat from Kafka nodes. The followings are Zookeeper zoo.cfg: tickTime=2000 dataDir=/var/zookeeper/ clientPort=2181 initLimit=5 syncLimit=2 server.1=zookeeper1:2888:3888 server.2=zookeeper2:2888:3888 server.3=zookeeper3:2888:3888 and Kafka server.properties customized setting: broker.id=1 listeners = PLAINTEXT://kafka1:9092 num.partitions=24 delete.topic.enable=true default.replication.factor=2 log.dirs=/data/kafka/data zookeeper.connect=zookeeper1:2181,zookeeper2:2181,zookeeper3:2181 log.retention.hours=168 I use the same zookeeper cluster for Hadoop HA without any problem. I think there is something wrong with the Kafka properties listeners and advertised.listeners. I read the Kafka documentation but couldn't understand their meaning. In the host file of all OSes, hostnames such that zookeeper1 to zookeeper3 and kafka1 to kafka3 are defined and reachable through ping command. I removed the following lines from hosts: 127.0.0.1 localhost 127.0.1.1 hostname I think it couldn't cause the problem. Kafka version: 0.11 Zookeeper version: 3.4.10 Can anyone help?
We were facing a similar issue with Kafka. As #Soheil pointed out it was due to a Major GC running. When a Major GC runs, then Kafka would sometimes not be able to send heartbeat to zookeeper. For us the Major GC was running almost once every 15 sec. On taking a heap dump, we realized it was due to a Metric Memory Leak in Kafka.
Zookeeper sessions keep expiring...no heartbeats?
We are using Kafka high level consumer , and we are able to successfully consume messages but the zookeeper connections keep expiring and reestablishing. I am wondering why are there no heartbeats to keep the connections alive: Kafka Consumer Logs ==================== [localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 1096 ] - Client session timed out, have not heard from server in 2666ms for sessionid 0x153175bd3860159, closing socket connection and attempting reconnect 2016-03-08 18:00:06,750 INFO [localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 975 ] - Opening socket connection to server 10.41.105.23/10.41.105.23:2181. Will not attempt to authenticate using SASL (unknown error) 2016-03-08 18:00:06,823 INFO [localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 852 ] - Socket connection established to 10.41.105.23/10.41.105.23:2181, initiating session 2016-03-08 18:00:06,892 INFO [localhost-startStop-1-SendThread(10.41.105.23:2181)] [ClientCnxn$SendThread] [line : 1235 ] - Session establishment complete on server 10.41.105.23/10.41.105.23:2181, sessionid = 0x153175bd3860159, negotiated timeout = 4000 Zookeeper Logs ================== [2016-03-08 17:44:37,722] INFO Accepted socket connection from /10.10.113.92:51333 (org.apache.zookeeper.server.NIOServerCnxnFactory) [2016-03-08 17:44:37,742] INFO Client attempting to renew session 0x153175bd3860159 at /10.10.113.92:51333 (org.apache.zookeeper.server.ZooKeeperServer) [2016-03-08 17:44:37,742] INFO Established session 0x153175bd3860159 with negotiated timeout 4000 for client /10.10.113.92:51333 (org.apache.zookeeper.server.ZooKeeperServer) [2016-03-08 17:46:56,000] INFO Expiring session 0x153175bd3860151, timeout of 4000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer) [2016-03-08 17:46:56,001] INFO Processed session termination for sessionid: 0x153175bd3860151 (org.apache.zookeeper.server.PrepRequestProcessor) [2016-03-08 17:46:56,011] INFO Closed socket connection for client /10.10.114.183:38324 which had sessionid 0x153175bd3860151 (org.apache.zookeeper.server.NIOServerCnxn)
Often ZooKeeper session timeouts are caused by "soft failures," which are most commonly a garbage collection pause. Turn on GC logging and see if a long GC occurs at the time the connection times out. Also, read about JVM tuning in Kafka.
[2016-03-08 17:46:56,000] INFO Expiring session 0x153175bd3860151, timeout of 4000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer) What is Zookeeper's maxSessionTimeout? If it's just 4000ms (4 seconds), then it's way too small. In Cloudera distribution of Hadoop, ZK's maxSessionTimeout is by default 40s (40000ms). As explained in ZK configuration - https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html it defaults 20 ticks (and one tick by default is 2 seconds).