Prysm.sh beacon-chain stopped synchronization - How to fix? - beacon

I try to sync my Geth, but it stuck.
I see next errors in Prysm
[2022-12-30 08:24:56] INFO p2p: Peer summary activePeers=42 inbound=0 outbound=42
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463723 - estimated time remaining 7h50m56s blocksPerSecond=9.6 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463723 - estimated time remaining 5h53m7s blocksPerSecond=12.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463723 - estimated time remaining 4h45m6s blocksPerSecond=15.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463723 - estimated time remaining 3h58m24s blocksPerSecond=18.9 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:09] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463723 - estimated time remaining 3h58m41s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:16] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x0ed9c790... 5192400/5463724 - estimated time remaining 11h57m47s blocksPerSecond=6.3 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x4f227a74a20bebe01369ac220aedacd7e4c986e8f569694f3c643da0cc9cfe83 (in processBatchedBlocks, slot=5192400)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463724 - estimated time remaining 7h55m53s blocksPerSecond=9.5 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463724 - estimated time remaining 5h55m54s blocksPerSecond=12.7 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463724 - estimated time remaining 4h46m54s blocksPerSecond=15.8 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463724 - estimated time remaining 3h59m40s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x74b66233... 5192720/5463724 - estimated time remaining 3h24m50s blocksPerSecond=22.1 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xed2e36835d677750e41fecc7a16c7d8669fd8a8ff629c9480b575d6d27c26085 (in processBatchedBlocks, slot=5192720)
[2022-12-30 08:25:27] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463725 - estimated time remaining 2h59m8s blocksPerSecond=25.2 peers=45
[2022-12-30 08:25:33] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
And next I see in Geth
ERROR[12-30|08:30:48.985] Error in block freeze operation err="block receipts missing, can't freeze block 15850304"
WARN [12-30|08:30:55.145] Previously seen beacon client is offline. Please ensure it is operational to follow the chain!
WARN [12-30|08:30:56.688] Ignoring already known beacon payload number=16,026,540 hash=145907..4f081f age=1mo1w16h
WARN [12-30|08:30:56.696] Ignoring already known beacon payload number=16,026,541 hash=6acd62..e3bac2 age=1mo1w16h
WARN [12-30|08:30:56.701] Ignoring already known beacon payload number=16,026,542 hash=1da1d8..c55a69 age=1mo1w16h
WARN [12-30|08:30:56.708] Ignoring already known beacon payload number=16,026,543 hash=762b24..f56957 age=1mo1w16h
WARN [12-30|08:30:56.719] Ignoring already known beacon payload number=16,026,544 hash=ff1aef..389471 age=1mo1w16h
WARN [12-30|08:30:56.725] Ignoring already known beacon payload number=16,026,545 hash=6767aa..85bf1d age=1mo1w16h
WARN [12-30|08:30:56.730] Ignoring already known beacon payload number=16,026,546 hash=95b736..dc2456 age=1mo1w16h
WARN [12-30|08:30:56.732] Ignoring already known beacon payload number=16,026,547 hash=34e43f..777810 age=1mo1w16h
WARN [12-30|08:30:56.742] Ignoring already known beacon payload number=16,026,548 hash=1c67b8..cbc356 age=1mo1w16h
WARN [12-30|08:30:56.750] Ignoring already known beacon payload number=16,026,549 hash=fe9e47..ed347e age=1mo1w16h
WARN [12-30|08:30:56.754] Ignoring already known beacon payload number=16,026,550 hash=c98bf1..40560a age=1mo1w16h
WARN [12-30|08:30:56.772] Ignoring already known beacon payload number=16,026,551 hash=f55377..a1582e age=1mo1w16h
WARN [12-30|08:30:56.780] Ignoring already known beacon payload number=16,026,552 hash=0bf769..af0ed8 age=1mo1w16h
WARN [12-30|08:30:56.784] Ignoring already known beacon payload number=16,026,553 hash=382866..a5a4f8 age=1mo1w16h
WARN [12-30|08:30:56.907] Ignoring already known beacon payload number=16,026,554 hash=65d2ff..6ebef5 age=1mo1w16h
WARN [12-30|08:30:56.918] Ignoring already known beacon payload number=16,026,555 hash=f04209..4779e9 age=1mo1w16h
WARN [12-30|08:30:56.935] Ignoring already known beacon payload number=16,026,556 hash=f2b1ab..373dc0 age=1mo1w16h
WARN [12-30|08:30:56.943] Ignoring already known beacon payload number=16,026,557 hash=979712..00891d age=1mo1w16h
WARN [12-30|08:30:56.956] Ignoring already known beacon payload number=16,026,558 hash=b53705..8483a1 age=1mo1w16h
WARN [12-30|08:30:56.979] Ignoring already known beacon payload number=16,026,559 hash=61e689..7e7c79 age=1mo1w16h
WARN [12-30|08:30:56.994] Ignoring already known beacon payload number=16,026,560 hash=a0f45b..802daf age=1mo1w16h
WARN [12-30|08:30:57.004] Ignoring already known beacon payload number=16,026,561 hash=037435..474e8d age=1mo1w16h
WARN [12-30|08:30:57.009] Ignoring already known beacon payload number=16,026,562 hash=565f15..bf9980 age=1mo1w16h
WARN [12-30|08:30:57.031] Ignoring already known beacon payload number=16,026,563 hash=c7f6ef..cc5ddf age=1mo1w16h
WARN [12-30|08:30:57.033] Ignoring already known beacon payload number=16,026,564 hash=c87d53..223987 age=1mo1w16h
WARN [12-30|08:30:57.068] Ignoring already known beacon payload number=16,026,565 hash=51f821..fc1a26 age=1mo1w16h
Has anyone encountered similar problems before?
I already tried to sync again. It didn`t help.

Related

How to fix prysm.sh beacon-chain stopped at 99% progress?

geth stopped working after the merge, when I fix it by beacon-chain, beacon stopped progress at 99%.
command:
prysm.sh beacon-chain --execution-endpoint=http://localhost:8551 --datadir=/disk1/prysm/.eth2
logs:
[2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0x35967a9b... 4700032/4737924 - estimated time remaining 22m9s blocksPerSecond=28.5 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x97b0b7b53582569689c52dbee87990ea2d7a94b17ee823e704c99d07e81b5376 (in processBatchedBlocks, slot=4700032) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0x063c579d... 4700096/4737924 - estimated time remaining 19m55s blocksPerSecond=31.6 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x91593f4ce1da4260a4475807af54ada66481b2e5529859fbcdd636c59966ac5d (in processBatchedBlocks, slot=4700096) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 62 starting from 0xe5a59df5... 4700160/4737924 - estimated time remaining 18m6s blocksPerSecond=34.8 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x099ce628bdb98cd34673e06f779a695a9fa903472f95f778a823c4b271296669 (in processBatchedBlocks, slot=4700160) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0xf0b0a565... 4700224/4737924 - estimated time remaining 16m33s blocksPerSecond=38.0 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x6a644e5ce7eb9063ac0334eb070469ffe1babef71b42fc295a0098410c8509ff (in processBatchedBlocks, slot=4700224) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0x4aef416e... 4700288/4737924 - estimated time remaining 15m14s blocksPerSecond=41.1 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xa165b5186776ed6adbeffbe7f9861a25cfe9e9a79b79fbf63c44f0f3f0fd2433 (in processBatchedBlocks, slot=4700288) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0xe2ad65e3... 4700352/4737924 - estimated time remaining 14m7s blocksPerSecond=44.4 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x0d5b8ab2983591a9dd27b6b6b99540f75de7a6b7f88dfe6dd83ac5e8316b0d79 (in processBatchedBlocks, slot=4700352) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0xe08fde61... 4700416/4737924 - estimated time remaining 13m9s blocksPerSecond=47.5 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xb493b115a9e7dadff196d1fd9092c477b3503a148983a6cd37111a00ba526862 (in processBatchedBlocks, slot=4700416)
I'm getting same error, you can check your GETH execution node, I think it stuck because of the GETH error
WARN [09-21|18:45:48.100] Ignoring already known beacon payload number=15,580,285 hash=d2b656..2c59c3 age=3h23m49s
My beacon-chain got these error/warning
[2022-09-21 18:48:26] WARN powchain: Execution client is not syncing
[2022-09-21 18:48:37] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x43b75c52f29244ca3caee03c5d3dbc52ac19e12d1fd8d2ae3e28c358719cb028 (in processBatchedBlocks, slot=4743744)
my eth.blockNumber is always 15,580,285

Large volume of scheduled messages seem to get stuck on ActiveMQ Artemis broker

I am using Apache ArtemisMQ 2.17.0 to store a few million scheduled messages. Due to the volume of messages paging is
triggered and almost half of the messages are stored on shared filesystem (master-slave shared filesystem store (NFSv4) ha topology).
These messages are scheduled every X hours and each "batch" is around 500k messages with the size of each individual
message a bit larger than 1KB.
In essence my use case dictates at some point near midnight to produce 4-5 million of messages which are scheduled to leave next day as bunches in predefined scheduled periods (e.g. 11a.m., 3 p.m., 6p.m.). Those messages produced are not ordered by scheduled time as messages for timeslot 6p.m. can be written to the queue earlier from other messages and therefore scheduled messages can be interleaved in order. Also since the volume of messages is pretty large I
can witness that address memory used is maxing out and multiple files are created on the paging directory for the queue.
My issue appears when my jms application starts to consume messages from the specified queue and though it starts to
consume data very fast at some point it blocks and becomes non responsive. When I check the broker's logs I can see the
following:
2021-03-31 15:26:03,520 WARN [org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component org.apache.activemq.artemis.core.server.impl.QueueImpl
is expired on path 3: java.lang.Exception: entered
at org.apache.activemq.artemis.utils.critical.CriticalMeasure.enterCritical(CriticalMeasure.java:56) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.critical.CriticalComponentImpl.enterCritical(CriticalComponentImpl.java:52) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addConsumer(QueueImpl.java:1403) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.<init>(ServerConsumerImpl.java:262) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.createConsumer(ServerSessionImpl.java:569) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:328) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:292) [artemis-server-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.17.0.jar:2.17.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65) [artemis-commons-2.17.0.jar:2.17.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_262]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_262]
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.17.0.jar:2.17.0]
2021-03-31 15:26:03,525 ERROR [org.apache.activemq.artemis.core.server] AMQ224079: The process for the virtual machine will be killed, as component
QueueImpl[name=my-queue, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=f3fddf74-9212-11eb-9a18-005056b570b4],
temp=false]#5a4be15a is not responsive
2021-03-31 15:26:03,980 WARN [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: **********
The broker halts and the slave broker becomes alive but the messages scheduled are still hanging on the queue.
When restarting the master broker I can see some logs like these below
2021-03-31 15:59:41,810 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session f558ac8f-9220-11eb-98a4-005056b5d5f6
2021-03-31 15:59:41,814 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /ip-app:52922 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222172: Queue my-queue was busy for more than 10,000 milliseconds. There are possibly consumers hanging on a network operation
2021-03-31 16:01:14,163 WARN [org.apache.activemq.artemis.core.server] AMQ222144: Queue could not finish waiting executors. Try increasing the thread pool size
Taking a look at cpu and memory metrics I do not see anything unusual since CPU at the time of consuming is less than 50% of the max load and memory of the broker host is also at the same levels (60% used). I/O is rather insignificant, but what may be helpful is that the number of blocking threads has a sharp increase just before that error (0 -> 40). Also heap memory is maxed out but I do not see any GC out of the ordinary as far as I can tell.
This figure is after reproducing it for messages scheduled to leave at 2:30p.m.
Also part of thread dump showing blocked and timed_waiting threads
"Thread-2 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=44 TIMED_WAITING on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at sun.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject#10e20f4f
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:112)
at org.apache.activemq.artemis.utils.ActiveMQThreadPoolExecutor$ThreadPoolQueue.poll(ActiveMQThreadPoolExecutor.java:45)
at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
"Thread-1 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6#2a54a73f)" Id=43 BLOCKED on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c owned by "Thread-3 (ActiveMQ-scheduled-threads)" Id=24
at org.apache.activemq.artemis.core.server.impl.RefsOperation.afterCommit(RefsOperation.java:182)
- blocked on org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:579)
- locked org.apache.activemq.artemis.core.transaction.impl.TransactionImpl#26fb9cb9
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.access$100(TransactionImpl.java:40)
at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl$2.done(TransactionImpl.java:322)
at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:279)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:65)
at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$30/1259174396.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#535779e4
"Thread-3 (ActiveMQ-scheduled-threads)" Id=24 RUNNABLE
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:143)
at org.apache.activemq.artemis.core.io.nio.NIOSequentialFile.open(NIOSequentialFile.java:98)
- locked org.apache.activemq.artemis.core.io.nio.NIOSequentialFile#520b145f
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.openPage(PageReader.java:114)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:83)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageReader.getMessage(PageReader.java:105)
- locked org.apache.activemq.artemis.core.paging.cursor.impl.PageReader#669a8420
at org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.getMessage(PageCursorProviderImpl.java:151)
at org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl.queryMessage(PageSubscriptionImpl.java:634)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getPagedMessage(PagedReferenceImpl.java:132)
- locked org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl#3bfc8d39
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessage(PagedReferenceImpl.java:99)
at org.apache.activemq.artemis.core.paging.cursor.PagedReferenceImpl.getMessageMemoryEstimate(PagedReferenceImpl.java:186)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.internalAddHead(QueueImpl.java:2839)
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1102)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.QueueImpl.addHead(QueueImpl.java:1138)
- locked org.apache.activemq.artemis.core.server.impl.QueueImpl#64e9ee3c
at org.apache.activemq.artemis.core.server.impl.ScheduledDeliveryHandlerImpl$ScheduledDeliveryRunnable.run(ScheduledDeliveryHandlerImpl.java:264)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Number of locked synchronizers = 1
- java.util.concurrent.ThreadPoolExecutor$Worker#11f0a5a1
Note also that I did try increasing the memory resources on the broker so as to avoid triggering paging messages on disk and doing so made the problem disappear, but since my message volume is going to be erratic I do not see that as a long term solution.
Can you give me any pointers how to resolve this issue? How can I cope with large volumes of paged data stored in the broker that need
to be released at large chunks to consumers ?
Edit: After increasing number of scheduled threads
After using an increased number of scheduled threads critical analyzer did not terminate the broker but I got constant warnings like the ones below
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4606893a-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 460eedac-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,818 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46194def-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 4620ef13-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 46289036-9d2b-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 562d6a93-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,819 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 56324c96-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:48:26,838 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,840 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,855 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:48:26,864 WARN [org.apache.activemq.artemis.core.server] AMQ222066: Reattach request from /my-host:47392 failed as there is no confirmationWindowSize configured, which may be ok for your system
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
2021-04-14 17:49:26,804 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 82978142-9d30-11eb-9b31-005056b5d5f6
And traffic on my consumer side had spike and dips as shown in the following figure
which essentially crippled throughput. Note that more than 80% percent of the messages were already in memory and only a small portion was paged on disk.
I think the two most important things for your use-case are going to be:
Avoid paging. Paging is a palliative measure meant to be used as a last resort to keep the broker functioning. If at all possible you should configure your broker to handle your load without paging (e.g. acquire more RAM, allocate more heap). It's worth noting that the broker is not designed like a database. It is designed for messages to flow through it. It can certainly buffer messages (potentially millions depending on the configuration & hardware) but when its forced to page the performance will drop substantially simply because disk is orders of magnitude slower than RAM.
Increase scheduled-thread-pool-max-size. Dumping this many scheduled messages on the broker is going to put tremendous pressure on the scheduled thread pool. The default size is only 5. I suggest you increase that until you stop seeing performance benefits.

Kafka Broker Not able to start

I am having a 3 node Kafka Cluster. One of the broker is not starting, i am getting below error. I have tried deleting index files but still, same error coming. Please help to understand what is this issue and how can I recover.
INFO [2018-09-05 11:58:49,585] kafka.log.Log:[Logging$class:info:66] - [pool-4-thread-1] - [Log partition=Topic3-15, dir=/var/lib/kafka/kafka-logs] Completed load of log with 1 segments, log start offset 11547004 and log end offset 11559178 in 1552 ms
INFO [2018-09-05 11:58:49,589] kafka.log.Log:[Logging$class:info:66] - [pool-4-thread-1] - [Log partition=Topic3-13, dir=/var/lib/kafka/kafka-logs] Recovering unflushed segment 12399433
ERROR [2018-09-05 11:58:49,591] kafka.log.LogManager:[Logging$class:error:74] - [main] - There was an error in one of the threads during logs loading: java.lang.IllegalArgumentException: inconsistent range
WARN [2018-09-05 11:58:49,591] kafka.log.Log:[Logging$class:warn:70] - [pool-4-thread-1] - [Log partition=Topic3-35, dir=/var/lib/kafka/kafka-logs] Found a corrupted index file corresponding to log file /var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.log due to Corrupt time index found, time index file (/var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1536129815049}, recovering segment and rebuilding index files...
INFO [2018-09-05 11:58:49,594] kafka.log.ProducerStateManager:[Logging$class:info:66] - [pool-4-thread-1] - [ProducerStateManager partition=Topic3-35] Loading producer state from snapshot file '/var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.snapshot'
ERROR [2018-09-05 11:58:49,599] kafka.server.KafkaServer:[MarkerIgnoringBase:error:159] - [main] - [KafkaServer id=2] Fatal error during KafkaServer startup. Prepare to shutdown
java.lang.IllegalArgumentException: inconsistent range
at java.util.concurrent.ConcurrentSkipListMap$SubMap.(ConcurrentSkipListMap.java:2620)
at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:2078)
at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:2114)
at kafka.log.Log$$anonfun$12.apply(Log.scala:1561)
at kafka.log.Log$$anonfun$12.apply(Log.scala:1560)
at scala.Option.map(Option.scala:146)
at kafka.log.Log.logSegments(Log.scala:1560)
at kafka.log.Log.kafka$log$Log$$recoverSegment(Log.scala:358)
at kafka.log.Log.recoverLog(Log.scala:448)
at kafka.log.Log.loadSegments(Log.scala:421)
at kafka.log.Log.(Log.scala:216)
at kafka.log.Log$.apply(Log.scala:1747)
at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:255)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:335)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
INFO [2018-09-05 11:58:49,606] kafka.server.KafkaServer:[Logging$class:info:66] - [main] - [KafkaServer id=2] shutting down

Kafka cluster streams timeouts at high input

I'm running an Kafka cluster with 7 nodes and a lot of stream processing. Now I see infrequent errors in my Kafka Streams applications like at high input rates:
[2018-07-23 14:44:24,351] ERROR task [0_5] Error sending record to topic topic-name. No more offsets will be recorded for this task and the exception will eventually be thrown (org.apache.kafka.streams.processor.internals.RecordCollectorImpl) org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,021] ERROR stream-thread [StreamThread-2] Failed to commit StreamTask 0_5 state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,033] ERROR stream-thread [StreamThread-2] Failed while executing StreamTask 0_5 due to flush state: (org.apache.kafka.streams.processor.internals.StreamThread) org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask.flushState(StreamTask.java:423) at org.apache.kafka.streams.processor.internals.StreamThread$4.apply(StreamThread.java:555) at org.apache.kafka.streams.processor.internals.StreamThread.performOnTasks(StreamThread.java:501) at org.apache.kafka.streams.processor.internals.StreamThread.flushAllState(StreamThread.java:551) at org.apache.kafka.streams.processor.internals.StreamThread.shutdownTasksAndState(StreamThread.java:449) at org.apache.kafka.streams.processor.internals.StreamThread.shutdown(StreamThread.java:391) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:372) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
[2018-07-23 14:44:31,039] WARN stream-thread [StreamThread-2] Unexpected state transition from RUNNING to NOT_RUNNING. (org.apache.kafka.streams.processor.internals.StreamThread) Exception in thread "StreamThread-2" org.apache.kafka.streams.errors.StreamsException: task [0_5] exception caught when producing at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.checkForException(RecordCollectorImpl.java:121) at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.flush(RecordCollectorImpl.java:129) at org.apache.kafka.streams.processor.internals.StreamTask$1.run(StreamTask.java:76) at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:188) at org.apache.kafka.streams.processor.internals.StreamTask.commit(StreamTask.java:281) at org.apache.kafka.streams.processor.internals.StreamThread.commitOne(StreamThread.java:807) at org.apache.kafka.streams.processor.internals.StreamThread.commitAll(StreamThread.java:794) at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:769) at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:647) at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:361) Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 13 record(s) for topic-name-3: 60060 ms has passed since last append
If I reduce the input rate (from 20k to 10k events/s) the errors are gone away. So obviously I'm reaching any sort of limit. I have played around with different options (request.timeout.ms, linger.ms and batch.size) but every time the same result.
You seem to have reached some kind of limit. Based on the message 60060 ms has passed since last append I'd assume it's writher thread starvation due to high load, so disk would be the first thing to check:
disk usage - if you're reaching write speed limit, switching from hdd to ssd might help
load distribution - is your traffic split +- equally to all nodes?
CPU load - lots of processing can
we had similar issue.
in our case we had the following configuration for replication and acknowledgement:
replication.factor: 3
producer.acks: all
and under high load the same error occurred multiple times TimeoutException: Expiring N record(s) for topic: N ms has passed since last append.
after removing our custom replication.factor and producer.acks configs (so we now using default values), and this error has disapearred.
Definitely it takes much more time on producer side until leader will receive full set of in-sync replicas to acknowledge the record, and until records replicated with specified replication.factor.
You will be slightly less protected on fault tolerance with default values.
Also potentially consider to increase the number of partitions per topic and number of application nodes (in which your kafka stream logic processed).

Why are the executors getting killed by the driver?

The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.