MessageAwareListener Error processing courier, backing off for 32000 milliseconds (AS 5.1 + ESB 4.10) - jboss5.x

I have a problem with my ESB archives. I'm porting ESB projects from ESB server 4.4 to ESB 4.10 integrated in Jboss AS 5.1.0. I had built projects with right Runtime Environment 5.1. When the server started I have periodic Warnings. I had seen more topics in jboss esb community but no one resolve the problem. I have tried to increase my connection pool in jboss-esb.properties but no change. Then an extract-log:
17:19:40,559 INFO [MessageAwareListener] State reached : false
17:19:41,565 WARN [MessageAwareListener] Error processing courier, backing off for 8000 milliseconds
17:19:49,568 INFO [MessageAwareListener] State reached : false
17:19:50,582 WARN [MessageAwareListener] Error processing courier, backing off for 16000 milliseconds
17:20:06,584 INFO [MessageAwareListener] State reached : false
17:20:07,592 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:20:39,594 INFO [MessageAwareListener] State reached : false
17:20:40,603 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:21:12,617 INFO [MessageAwareListener] State reached : false
17:21:13,639 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:21:45,641 INFO [MessageAwareListener] State reached : false
17:21:46,650 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:22:18,652 INFO [MessageAwareListener] State reached : false
17:22:19,662 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:22:51,664 INFO [MessageAwareListener] State reached : false
17:22:52,672 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:23:24,674 INFO [MessageAwareListener] State reached : false
17:23:25,683 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:23:57,684 INFO [MessageAwareListener] State reached : false
17:23:58,692 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:24:30,694 INFO [MessageAwareListener] State reached : false
17:24:31,701 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:25:03,703 INFO [MessageAwareListener] State reached : false
17:25:04,711 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:25:36,713 INFO [MessageAwareListener] State reached : false
17:25:37,720 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:26:09,722 INFO [MessageAwareListener] State reached : false
17:26:10,729 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:26:42,731 INFO [MessageAwareListener] State reached : false
17:26:43,738 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:27:15,740 INFO [MessageAwareListener] State reached : false
17:27:16,747 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
17:27:48,749 INFO [MessageAwareListener] State reached : false
17:27:49,756 WARN [MessageAwareListener] Error processing courier, backing off for 32000 milliseconds
Thanks in advance for your help
Best regards

Related

Prysm.sh beacon-chain stopped synchronization - How to fix?

I try to sync my Geth, but it stuck.
I see next errors in Prysm
[2022-12-30 08:24:56] INFO p2p: Peer summary activePeers=42 inbound=0 outbound=42
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463723 - estimated time remaining 7h50m56s blocksPerSecond=9.6 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463723 - estimated time remaining 5h53m7s blocksPerSecond=12.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463723 - estimated time remaining 4h45m6s blocksPerSecond=15.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463723 - estimated time remaining 3h58m24s blocksPerSecond=18.9 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:09] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463723 - estimated time remaining 3h58m41s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:16] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x0ed9c790... 5192400/5463724 - estimated time remaining 11h57m47s blocksPerSecond=6.3 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x4f227a74a20bebe01369ac220aedacd7e4c986e8f569694f3c643da0cc9cfe83 (in processBatchedBlocks, slot=5192400)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463724 - estimated time remaining 7h55m53s blocksPerSecond=9.5 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463724 - estimated time remaining 5h55m54s blocksPerSecond=12.7 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463724 - estimated time remaining 4h46m54s blocksPerSecond=15.8 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463724 - estimated time remaining 3h59m40s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x74b66233... 5192720/5463724 - estimated time remaining 3h24m50s blocksPerSecond=22.1 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xed2e36835d677750e41fecc7a16c7d8669fd8a8ff629c9480b575d6d27c26085 (in processBatchedBlocks, slot=5192720)
[2022-12-30 08:25:27] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463725 - estimated time remaining 2h59m8s blocksPerSecond=25.2 peers=45
[2022-12-30 08:25:33] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
And next I see in Geth
ERROR[12-30|08:30:48.985] Error in block freeze operation err="block receipts missing, can't freeze block 15850304"
WARN [12-30|08:30:55.145] Previously seen beacon client is offline. Please ensure it is operational to follow the chain!
WARN [12-30|08:30:56.688] Ignoring already known beacon payload number=16,026,540 hash=145907..4f081f age=1mo1w16h
WARN [12-30|08:30:56.696] Ignoring already known beacon payload number=16,026,541 hash=6acd62..e3bac2 age=1mo1w16h
WARN [12-30|08:30:56.701] Ignoring already known beacon payload number=16,026,542 hash=1da1d8..c55a69 age=1mo1w16h
WARN [12-30|08:30:56.708] Ignoring already known beacon payload number=16,026,543 hash=762b24..f56957 age=1mo1w16h
WARN [12-30|08:30:56.719] Ignoring already known beacon payload number=16,026,544 hash=ff1aef..389471 age=1mo1w16h
WARN [12-30|08:30:56.725] Ignoring already known beacon payload number=16,026,545 hash=6767aa..85bf1d age=1mo1w16h
WARN [12-30|08:30:56.730] Ignoring already known beacon payload number=16,026,546 hash=95b736..dc2456 age=1mo1w16h
WARN [12-30|08:30:56.732] Ignoring already known beacon payload number=16,026,547 hash=34e43f..777810 age=1mo1w16h
WARN [12-30|08:30:56.742] Ignoring already known beacon payload number=16,026,548 hash=1c67b8..cbc356 age=1mo1w16h
WARN [12-30|08:30:56.750] Ignoring already known beacon payload number=16,026,549 hash=fe9e47..ed347e age=1mo1w16h
WARN [12-30|08:30:56.754] Ignoring already known beacon payload number=16,026,550 hash=c98bf1..40560a age=1mo1w16h
WARN [12-30|08:30:56.772] Ignoring already known beacon payload number=16,026,551 hash=f55377..a1582e age=1mo1w16h
WARN [12-30|08:30:56.780] Ignoring already known beacon payload number=16,026,552 hash=0bf769..af0ed8 age=1mo1w16h
WARN [12-30|08:30:56.784] Ignoring already known beacon payload number=16,026,553 hash=382866..a5a4f8 age=1mo1w16h
WARN [12-30|08:30:56.907] Ignoring already known beacon payload number=16,026,554 hash=65d2ff..6ebef5 age=1mo1w16h
WARN [12-30|08:30:56.918] Ignoring already known beacon payload number=16,026,555 hash=f04209..4779e9 age=1mo1w16h
WARN [12-30|08:30:56.935] Ignoring already known beacon payload number=16,026,556 hash=f2b1ab..373dc0 age=1mo1w16h
WARN [12-30|08:30:56.943] Ignoring already known beacon payload number=16,026,557 hash=979712..00891d age=1mo1w16h
WARN [12-30|08:30:56.956] Ignoring already known beacon payload number=16,026,558 hash=b53705..8483a1 age=1mo1w16h
WARN [12-30|08:30:56.979] Ignoring already known beacon payload number=16,026,559 hash=61e689..7e7c79 age=1mo1w16h
WARN [12-30|08:30:56.994] Ignoring already known beacon payload number=16,026,560 hash=a0f45b..802daf age=1mo1w16h
WARN [12-30|08:30:57.004] Ignoring already known beacon payload number=16,026,561 hash=037435..474e8d age=1mo1w16h
WARN [12-30|08:30:57.009] Ignoring already known beacon payload number=16,026,562 hash=565f15..bf9980 age=1mo1w16h
WARN [12-30|08:30:57.031] Ignoring already known beacon payload number=16,026,563 hash=c7f6ef..cc5ddf age=1mo1w16h
WARN [12-30|08:30:57.033] Ignoring already known beacon payload number=16,026,564 hash=c87d53..223987 age=1mo1w16h
WARN [12-30|08:30:57.068] Ignoring already known beacon payload number=16,026,565 hash=51f821..fc1a26 age=1mo1w16h
Has anyone encountered similar problems before?
I already tried to sync again. It didn`t help.

Apache geode SerialGatewaySenderQueue blocked : There are 86 stuck threads in this node

022-12-20 02:09:13.085 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - org.apache.geode.internal.cache.wan.GatewaySenderAdvisor#31e1b6b8 is becoming primary gateway Sender.
2022-12-20 02:09:13.085 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =false} : Starting as primary
2022-12-20 02:09:13.128 [Gateway Sender Primary Lock Acquisition Thread Volunteer] INFO org.apache.geode.internal.cache.wan.GatewaySenderAdvisor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =false} : Becoming primary gateway sender
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.3] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 5 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.2] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 2 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.4] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.129 [Event Processor for GatewaySender_sender1.0] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - Gateway Failover Initiated: Processing 6 unprocessed events.
2022-12-20 02:09:13.143 [Pooled Serial Message Processor2-1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - GatewayEventProcessor[gatewaySenderId=sender1;remoteDSId=2;batchSize=100] : Waiting for failover completion
2022-12-20 02:09:13.146 [Event Processor for GatewaySender_sender1.1] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 5 events as possible duplicates
2022-12-20 02:09:13.148 [Event Processor for GatewaySender_sender1.3] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 6 events as possible duplicates
2022-12-20 02:09:13.149 [Event Processor for GatewaySender_sender1.4] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 7 events as possible duplicates
2022-12-20 02:09:13.150 [Event Processor for GatewaySender_sender1.2] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 9 events as possible duplicates
2022-12-20 02:09:13.150 [Event Processor for GatewaySender_sender1.0] INFO org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor - SerialGatewaySender{id=sender1,remoteDsId=2,isRunning =true,isPrimary =true} : Marking 9 events as possible duplicates
2022-12-20 02:09:17.146 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - received leave request from 10.4.20.34(20.34-S1:121257)<v203>:1025 for 10.4.20.34(20.34-S1:121257)<v203>:1025
2022-12-20 02:09:17.147 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - JoinLeave.processMessage(LeaveRequestMessage) invoked. isCoordinator=false; isStopping=false; cancelInProgress=false
2022-12-20 02:09:17.147 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - Checking to see if I should become coordinator. My address is 10.4.20.148(20.148-S1:54029)<v229>:1024
2022-12-20 02:09:17.148 [Pooled High Priority Message Processor 15] INFO org.apache.geode.distributed.internal.membership.gms.Services - View with removed and left members removed is View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] and coordinator would be 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024
2022-12-20 02:09:17.251 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - received leave request from 10.4.20.34(20.34-S1:121257)<v203>:1025 for 10.4.20.34(20.34-S1:121257)<v203>:1025
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - JoinLeave.processMessage(LeaveRequestMessage) invoked. isCoordinator=false; isStopping=false; cancelInProgress=false
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - Checking to see if I should become coordinator. My address is 10.4.20.148(20.148-S1:54029)<v229>:1024
2022-12-20 02:09:17.252 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - View with removed and left members removed is View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] and coordinator would be 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024
2022-12-20 02:09:17.451 [unicast receiver,95306-YY-V020148-64048] INFO org.apache.geode.distributed.internal.membership.gms.Services - received new view: View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|235] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.148(20.148-S1:54029)<v229>:1024{lead}, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024] shutdown: [10.4.20.34(20.34-S1:121257)<v203>:1025]
old view is: View[10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024|234] members: [10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.34(20.34-S1:121257)<v203>:1025{lead}, 10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]
2022-12-20 02:09:17.453 [Geode View Processor1] INFO org.apache.geode.distributed.internal.ClusterDistributionManager - Member at 10.4.20.34(20.34-S1:121257)<v203>:1025 gracefully left the distributed cache: departed membership view
2022-12-20 02:09:17.453 [Geode View Processor1] INFO org.apache.geode.distributed.internal.ClusterOperationExecutors - Marking the SerialQueuedExecutor with id : 3 used by the member 10.4.20.34(20.34-S1:121257)<v203>:1025 to be unused.
2022-12-20 02:09:23.540 [Function Execution Processor12] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:09:27.880 [Function Execution Processor11] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:09:31.419 [Thread-7] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11238 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.419 [Thread-8] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11239 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.421 [Thread-10] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11240 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.424 [Thread-9] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11241 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:31.427 [Thread-6] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11242 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2
2022-12-20 02:09:42.518 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.170.107(15:loner):49844:f703d5d6,connection=2; port=45420] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.518 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.28.152(15:loner):45140:3406d5d6,connection=2; port=41590] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - ClientHealthMonitor: Unregistering client with member id identity(10.255.28.152(15:loner):45140:3406d5d6,connection=2 due to: Unknown reason
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.242.77(13:loner):48228:1706d5d6,connection=2; port=57978] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:42.519 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - ClientHealthMonitor: Unregistering client with member id identity(10.255.242.77(13:loner):48228:1706d5d6,connection=2 due to: Unknown reason
2022-12-20 02:09:42.703 [ServerConnection on port 42973 Thread 61] WARN org.apache.geode.distributed.internal.ReplyProcessor21 - 15 seconds have elapsed while waiting for replies: <DistributedCacheOperation$CacheOperationReplyProcessor 11246 waiting for 3 replies from [10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]> on 10.4.20.148(20.148-S1:54029)<v229>:1024 whose current membership list is: [[10.4.20.148(20.148-S1:54029)<v229>:1024, 10.4.20.35(20.35-S1:67284)<v230>:1025, 10.4.20.34(20.34-L1:12493:locator)<ec><v0>:1024, 10.4.20.146(20.146-S1:125081)<v232>:1024, 10.4.20.35(20.35-L1:13244:locator)<ec><v1>:1024, 10.4.20.147(20.147-S1:22219)<v234>:1024]]
2022-12-20 02:09:47.522 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.16.178(20:loner):35578:8902d5d6,connection=2; port=60302] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:48.527 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.150.234(39:loner):50436:3406d5d6,connection=2; port=53498] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:09:51.529 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.95.181(14:loner):42602:2f02d5d6,connection=2; port=53556] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:25.576 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.150.234(39:loner):50436:3406d5d6,connection=2; port=56390] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:26.577 [ClientHealthMonitor Thread] WARN org.apache.geode.internal.cache.tier.sockets.ClientHealthMonitor - Server connection from [identity(10.255.16.178(20:loner):35578:8902d5d6,connection=2; port=60832] is being terminated because its client timeout of 10000 has expired.
2022-12-20 02:10:27.392 [Function Execution Processor7] DEBUG com.wntime.geode.function.OperateDataFunction - start exec logic : OperateDataFunction
2022-12-20 02:10:29.380 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.ThreadsMonitoringProcess - Thread 641 (0x281) is stuck
2022-12-20 02:10:29.396 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.executor.AbstractExecutor - Thread <641> (0x281) that was executed at <20 Dec 2022 02:09:19 CST> has been stuck for <70.145 seconds> and number of thread monitor iteration <1>
Thread Name <Pooled Serial Message Processor1-1> state <WAITING>
Waiting on <java.util.concurrent.locks.ReentrantReadWriteLock$FairSync#1ce290a7>
Owned By <ServerConnection on port 42973 Thread 63> with ID <930>
Executor Group <SerialQueuedExecutorWithDMStats>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:223)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5777)
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$258/445977285.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$256/2071190751.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$255/113991150.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2044)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5602)
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:387)
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:170)
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5573)
org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:156)
org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:307)
org.apache.geode.internal.cache.DistributedPutAllOperation$PutAllMessage.doEntryPut(DistributedPutAllOperation.java:1114)
org.apache.geode.internal.cache.DistributedPutAllOperation$PutAllMessage$1.run(DistributedPutAllOperation.java:1194)
org.apache.geode.internal.cache.event.DistributedEventTracker.syncBulkOp(DistributedEventTracker.java:481)
Lock owner thread stack
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:72)
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:731)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:802)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:779)
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:865)
org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:779)
org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:676)
org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:277)
org.apache.geode.internal.cache.DistributedCacheOperation.distribute(DistributedCacheOperation.java:318)
org.apache.geode.internal.cache.DistributedRegion.distributeUpdate(DistributedRegion.java:514)
org.apache.geode.internal.cache.DistributedRegion.basicPutPart3(DistributedRegion.java:492)
org.apache.geode.internal.cache.map.RegionMapPut.doAfterCompletionActions(RegionMapPut.java:307)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:185)
org.apache.geode.internal.cache.map.AbstractRegionMapPut$$Lambda$255/113991150.run(Unknown Source)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2044)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5602)
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:387)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue$SerialGatewaySenderQueueMetaRegion.virtualPut(SerialGatewaySenderQueue.java:1215)
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5580)
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:156)
org.apache.geode.internal.cache.LocalRegion.basicPut(LocalRegion.java:5038)
org.apache.geode.internal.cache.LocalRegion.validatedPut(LocalRegion.java:1637)
org.apache.geode.internal.cache.LocalRegion.put(LocalRegion.java:1624)
org.apache.geode.internal.cache.AbstractRegion.put(AbstractRegion.java:442)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.putAndGetKey(SerialGatewaySenderQueue.java:245)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:232)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
2022-12-20 02:10:29.397 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.ThreadsMonitoringProcess - Thread 646 (0x286) is stuck
2022-12-20 02:10:29.413 [ThreadsMonitor] WARN org.apache.geode.internal.monitoring.executor.AbstractExecutor - Thread <646> (0x286) that was executed at <20 Dec 2022 02:09:16 CST> has been stuck for <72.891 seconds> and number of thread monitor iteration <1>
Thread Name <Pooled Serial Message Processor2-1> state <WAITING>
Waiting on <java.util.concurrent.locks.ReentrantReadWriteLock$FairSync#1ce290a7>
Owned By <ServerConnection on port 42973 Thread 63> with ID <930>
Executor Group <SerialQueuedExecutorWithDMStats>
Monitored metric <ResourceManagerStats.numThreadsStuck>
Thread stack:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:943)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderQueue.put(SerialGatewaySenderQueue.java:223)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.queuePrimaryEvent(SerialGatewaySenderEventProcessor.java:477)
org.apache.geode.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.enqueueEvent(SerialGatewaySenderEventProcessor.java:445)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:162)
org.apache.geode.internal.cache.wan.serial.ConcurrentSerialGatewaySenderEventProcessor.enqueueEvent(ConcurrentSerialGatewaySenderEventProcessor.java:116)
org.apache.geode.internal.cache.wan.AbstractGatewaySender.distribute(AbstractGatewaySender.java:1082)
org.apache.geode.internal.cache.LocalRegion.notifyGatewaySender(LocalRegion.java:6141)
org.apache.geode.internal.cache.LocalRegion.basicPutPart2(LocalRegion.java:5777)
org.apache.geode.internal.cache.map.RegionMapPut.doBeforeCompletionActions(RegionMapPut.java:282)
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:301)
There are 86 stuck threads in this node
The cluster uses WAN gateway, which is a two-way data transmission architecture.
When I restart the main sending gateway, the cluster fails, the cluster cannot write data, there are threads stuck, over time, finally 86 threads are stuck, the cluster after 57 minutes, all the stuck threads are restored, during this period has been reporting exceptions in the log, can not be recovered.
Hope to know what causes this and find a quick way to recover.

Losing connection to Kafka. What happens?

A jobmanager and taskmanager are running on a single VM. Also Kafka runs on the same server.
I have 10 tasks, all read from different kafka topics , process messages and write back to Kafka.
Sometimes I find my task manager is down and nothing is working. I tried to figure out the problem by checking the logs and I believe it is a problem with Kafka connection. (Or maybe a network problem?. But everything is on a single server.)
What I want to ask is, if for a short period I lose connection to Kafka what happens. Why tasks are failing and most importantly why task manager crushes?
Some logs:
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-8] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,626 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Disconnecting from node 0 due to request timeout.
2022-11-26 23:35:15,692 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=telefilter1-0, groupId=telefilter1] Cancelled in-flight FETCH request with correlation id 3630156 due to node 0 being disconnected (elapsed time since creation: 61648ms, elapsed time since send: 61648ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159429 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Consumer clientId=cpualgosgroup1-1, groupId=cpualgosgroup1] Cancelled in-flight FETCH request with correlation id 2344708 due to node 0 being disconnected (elapsed time since creation: 51184ms, elapsed time since send: 51184ms, request timeout: 30000ms)
2022-11-26 23:35:15,702 INFO org.apache.kafka.clients.NetworkClient [] - [Producer clientId=producer-15] Cancelled in-flight PRODUCE request with correlation id 2159430 due to node 0 being disconnected (elapsed time since creation: 51069ms, elapsed time since send: 51069ms, request timeout: 30000ms)
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-15] Received invalid metadata error in produce request on partition tele.alerts.cpu-4 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2022-11-26 23:35:15,842 WARN org.apache.kafka.clients.producer.internals.Sender [] - [Producer clientId=producer-8] Received invalid metadata error in produce request on partition tele.alerts.cpu-6 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now
2
and then
2022-11-26 23:35:56,673 WARN org.apache.flink.runtime.taskmanager.Task [] - CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
...
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
2022-11-26 23:35:56,682 INFO org.apache.flink.runtime.taskmanager.Task [] - Triggering cancellation of task code CPUTemperatureAnalysisAlgorithm -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_d0ae1ab03e621ff140fb6b0b0a2932f9_0_0).
2022-11-26 23:35:57,199 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0).
2022-11-26 23:35:57,202 WARN org.apache.flink.runtime.taskmanager.Task [] - TemperatureAnalysis -> Sink: Writer -> Sink: Committer (1/1)#0 (619139347a459b6de22089ff34edff39_15071110d0eea9f1c7f3d75503ff58eb_0_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for 8d57994a59ab86ea9ee48076e80a7c7f.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1702)
Why taskexecutor loses connection to JobManager?
If I dont care any data lost, how should I configure Kafka clients and flink recovery. I just want Kafka Client not to die. Especially I dont want my tasks or task managers to crush. If I lose connection, is it possible to configure Flink to just for wait? If we can`t read, wait and if we can't write back to Kafka, just wait?
The heartbeat of JobManager with id 99d52303d7e24496ae661ddea2b6a372 timed out.
Sounds like the server is somewhat overloaded. But you could try increasing the heartbeat timeout.

Large volume of scheduled messages seem to get stuck on ActiveMQ Artemis broker

I am using Apache ArtemisMQ 2.17.0 to store a few million scheduled messages. Due to the volume of messages paging is
triggered and almost half of the messages are stored on shared filesystem (master-slave shared filesystem store (NFSv4) ha topology).
These messages are scheduled every X hours and each "batch" is around 500k messages with the size of each individual
message a bit larger than 1KB.
In essence my use case dictates at some point near midnight to produce 4-5 million of messages which are scheduled to leave next day as bunches in predefined scheduled periods (e.g. 11a.m., 3 p.m., 6p.m.). Those messages produced are not ordered by scheduled time as messages for timeslot 6p.m. can be written to the queue earlier from other messages and therefore scheduled messages can be interleaved in order. Also since the volume of messages is pretty large I
can witness that address memory used is maxing out and multiple files are created on the paging directory for the queue.
My issue appears when my jms application starts to consume messages from the specified queue and though it starts to
consume data very fast at some point it blocks and becomes non responsive. When I check the broker's logs I can see the
following:
2021-03-31 15:26:03,520 WARN [org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component org.apache.activemq.artemis.core.server.impl.QueueImpl
is expired on path 3: java.lang.Exception: entered
at org.apache.activemq.artemis.utils.critical.CriticalMeasure.enterCritical(CriticalMeasure.java:56) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.critical.CriticalComponentImpl.enterCritical(CriticalComponentImpl.java:52) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addConsumer(QueueImpl.java:1403) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.<init>(ServerConsumerImpl.java:262) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.createConsumer(ServerSessionImpl.java:569) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:328) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:292) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_262]
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.17.0.jar:2.17.0]
2021-03-31 15:26:03,525 ERROR [org.apache.activemq.artemis.core.server] AMQ224079: The process for the virtual machine will be killed, as component
QueueImpl[name=my-queue, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=f3fddf74-9212-11eb-9a18-005056b570b4],
temp=false]#5a4be15a is not responsive
2021-03-31 15:26:03,980 WARN [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: **********
The broker halts and the slave broker becomes alive but the messages scheduled are still hanging on the queue.
When restarting the master broker I can see some logs like these below
2021-03-31 15:59:41,810 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session f558ac8f-9220-11eb-98a4-005056b5d5f6
2021-03-31 15:59:41,814 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /ip-app:52922 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222172: Queue my-queue was busy for more than 10,000 milliseconds. There are possibly consumers hanging on a network operation
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222144: Queue could not finish waiting executors. Try increasing the thread pool size
Taking a look at cpu and memory metrics I do not see anything unusual since CPU at the time of consuming is less than 50% of the max load and memory of the broker host is also at the same levels (60% used). I/O is rather insignificant, but what may be helpful is that the number of blocking threads has a sharp increase just before that error (0 -> 40). Also heap memory is maxed out but I do not see any GC out of the ordinary as far as I can tell.
This figure is after reproducing it for messages scheduled to leave at 2:30p.m.
Also part of thread dump showing blocked and timed_waiting threads
"Thread-2 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=44 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:112)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:45)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
"Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=43 BLOCKED on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c owned by "Thread-3 (ActiveMQ-scheduled-threads)" Id=24
at org.apache.activemq.artemis.core.server.impl.RefsOperation.afterCommit(RefsOperation.java:182)
- blocked on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:579)
- locked org.apache.activemq.artemis.core.transaction.impl.TransactionImpl#26fb9cb9
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.access$100(TransactionImpl.java:40)
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:322)
at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:279)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$30/1259174396.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#535779e4
"Thread-3 (ActiveMQ-scheduled-threads)" Id=24 RUNNABLE
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:143)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:98)
- locked org.apache.activemq.artemis.core.io.nio.NIOSequentialFile#520b145f
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.openPage(PageReader.java:114)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:83)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:105)
- locked org.apache.activemq.artemis.core.paging.cursor.impl.PageReader#669a8420
at org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.getMessage(PageCursorProviderImpl.java:151)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl.queryMessage(PageSubscriptionImpl.java:634)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getPagedMessage(PagedReferenceImpl.java:132)
- locked org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl#3bfc8d39
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessage(PagedReferenceImpl.java:99)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessageMemoryEstimate(PagedReferenceImpl.java:186)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.internalAddHead(QueueImpl.java:2839)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1102)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1138)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.ScheduledDeliveryHandlerImpl$ScheduledDeliveryRunnable.run(ScheduledDeliveryHandlerImpl.java:264)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#11f0a5a1
Note also that I did try increasing the memory resources on the broker so as to avoid triggering paging messages on disk and doing so made the problem disappear, but since my message volume is going to be erratic I do not see that as a long term solution.
Can you give me any pointers how to resolve this issue? How can I cope with large volumes of paged data stored in the broker that need
to be released at large chunks to consumers ?
Edit: After increasing number of scheduled threads
After using an increased number of scheduled threads critical analyzer did not terminate the broker but I got constant warnings like the ones below
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4606893a-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,838 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,840 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,855 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,864 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
And traffic on my consumer side had spike and dips as shown in the following figure
which essentially crippled throughput. Note that more than 80% percent of the messages were already in memory and only a small portion was paged on disk.
I think the two most important things for your use-case are going to be:
Avoid paging. Paging is a palliative measure meant to be used as a last resort to keep the broker functioning. If at all possible you should configure your broker to handle your load without paging (e.g. acquire more RAM, allocate more heap). It's worth noting that the broker is not designed like a database. It is designed for messages to flow through it. It can certainly buffer messages (potentially millions depending on the configuration & hardware) but when its forced to page the performance will drop substantially simply because disk is orders of magnitude slower than RAM.
Increase scheduled-thread-pool-max-size. Dumping this many scheduled messages on the broker is going to put tremendous pressure on the scheduled thread pool. The default size is only 5. I suggest you increase that until you stop seeing performance benefits.

ms has passed since batch creation plus linger time

I am using SQL 2.4.1v as consumer and Spring boot as producer for my Kafka topic. While I am trying to insert the records on to the topic, I am getting following error message:
WARN 12044 --- [ad | producer-1] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-1] Got error produce response with correlation id 4457 on topic-partition TRANS_INBOUND-20, retrying (0 attempts left). Error: NETWORK_EXCEPTION
WARN 12044 --- [ad | producer-1] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-1] Received invalid metadata error in produce request on partition TRANS_INBOUND-20 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now
ERROR 12044 --- [ad | producer-1] o.s.k.support.LoggingProducerListener : Exception thrown when sending a message with key='1-356194-2018-01-02-STATUS' and payload='com.TransRecord#48323556' to topic TRANS_INBOUND:
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for TRANS_INBOUND-82: 501 ms has passed since batch creation plus linger time
Following are my producer settings:
acks: 1
retries: 1
batchSize: 100
lingerMs: 5
bufferMemory: 33554432
requestTimeoutMs: 600
autoOffsetReset: latest
enableAutoCommit: false
reconnectBackoffMaxMs: 1000
reconnectBackoffMs: 50
retryBackoffMs: 100
I have already tried with many combinations of batchSize, lingerMs and requestTimeoutMs, nothing is working. I still see the above errors quite often. What might be wrong here, and how can I fix this problem?