How to fix prysm.sh beacon-chain stopped at 99% progress? - merge

geth stopped working after the merge, when I fix it by beacon-chain, beacon stopped progress at 99%.
command:
prysm.sh beacon-chain --execution-endpoint=http://localhost:8551 --datadir=/disk1/prysm/.eth2
logs:
[2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0x35967a9b... 4700032/4737924 - estimated time remaining 22m9s blocksPerSecond=28.5 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x97b0b7b53582569689c52dbee87990ea2d7a94b17ee823e704c99d07e81b5376 (in processBatchedBlocks, slot=4700032) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0x063c579d... 4700096/4737924 - estimated time remaining 19m55s blocksPerSecond=31.6 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x91593f4ce1da4260a4475807af54ada66481b2e5529859fbcdd636c59966ac5d (in processBatchedBlocks, slot=4700096) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 62 starting from 0xe5a59df5... 4700160/4737924 - estimated time remaining 18m6s blocksPerSecond=34.8 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x099ce628bdb98cd34673e06f779a695a9fa903472f95f778a823c4b271296669 (in processBatchedBlocks, slot=4700160) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0xf0b0a565... 4700224/4737924 - estimated time remaining 16m33s blocksPerSecond=38.0 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x6a644e5ce7eb9063ac0334eb070469ffe1babef71b42fc295a0098410c8509ff (in processBatchedBlocks, slot=4700224) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0x4aef416e... 4700288/4737924 - estimated time remaining 15m14s blocksPerSecond=41.1 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xa165b5186776ed6adbeffbe7f9861a25cfe9e9a79b79fbf63c44f0f3f0fd2433 (in processBatchedBlocks, slot=4700288) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0xe2ad65e3... 4700352/4737924 - estimated time remaining 14m7s blocksPerSecond=44.4 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x0d5b8ab2983591a9dd27b6b6b99540f75de7a6b7f88dfe6dd83ac5e8316b0d79 (in processBatchedBlocks, slot=4700352) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0xe08fde61... 4700416/4737924 - estimated time remaining 13m9s blocksPerSecond=47.5 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xb493b115a9e7dadff196d1fd9092c477b3503a148983a6cd37111a00ba526862 (in processBatchedBlocks, slot=4700416)

I'm getting same error, you can check your GETH execution node, I think it stuck because of the GETH error
WARN [09-21|18:45:48.100] Ignoring already known beacon payload number=15,580,285 hash=d2b656..2c59c3 age=3h23m49s
My beacon-chain got these error/warning
[2022-09-21 18:48:26] WARN powchain: Execution client is not syncing
[2022-09-21 18:48:37] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x43b75c52f29244ca3caee03c5d3dbc52ac19e12d1fd8d2ae3e28c358719cb028 (in processBatchedBlocks, slot=4743744)
my eth.blockNumber is always 15,580,285

Related

Prysm.sh beacon-chain stopped synchronization - How to fix?

I try to sync my Geth, but it stuck.
I see next errors in Prysm
[2022-12-30 08:24:56] INFO p2p: Peer summary activePeers=42 inbound=0 outbound=42
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463723 - estimated time remaining 7h50m56s blocksPerSecond=9.6 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463723 - estimated time remaining 5h53m7s blocksPerSecond=12.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463723 - estimated time remaining 4h45m6s blocksPerSecond=15.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463723 - estimated time remaining 3h58m24s blocksPerSecond=18.9 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:09] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463723 - estimated time remaining 3h58m41s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:16] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x0ed9c790... 5192400/5463724 - estimated time remaining 11h57m47s blocksPerSecond=6.3 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x4f227a74a20bebe01369ac220aedacd7e4c986e8f569694f3c643da0cc9cfe83 (in processBatchedBlocks, slot=5192400)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463724 - estimated time remaining 7h55m53s blocksPerSecond=9.5 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463724 - estimated time remaining 5h55m54s blocksPerSecond=12.7 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463724 - estimated time remaining 4h46m54s blocksPerSecond=15.8 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463724 - estimated time remaining 3h59m40s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x74b66233... 5192720/5463724 - estimated time remaining 3h24m50s blocksPerSecond=22.1 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xed2e36835d677750e41fecc7a16c7d8669fd8a8ff629c9480b575d6d27c26085 (in processBatchedBlocks, slot=5192720)
[2022-12-30 08:25:27] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463725 - estimated time remaining 2h59m8s blocksPerSecond=25.2 peers=45
[2022-12-30 08:25:33] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
And next I see in Geth
ERROR[12-30|08:30:48.985] Error in block freeze operation err="block receipts missing, can't freeze block 15850304"
WARN [12-30|08:30:55.145] Previously seen beacon client is offline. Please ensure it is operational to follow the chain!
WARN [12-30|08:30:56.688] Ignoring already known beacon payload number=16,026,540 hash=145907..4f081f age=1mo1w16h
WARN [12-30|08:30:56.696] Ignoring already known beacon payload number=16,026,541 hash=6acd62..e3bac2 age=1mo1w16h
WARN [12-30|08:30:56.701] Ignoring already known beacon payload number=16,026,542 hash=1da1d8..c55a69 age=1mo1w16h
WARN [12-30|08:30:56.708] Ignoring already known beacon payload number=16,026,543 hash=762b24..f56957 age=1mo1w16h
WARN [12-30|08:30:56.719] Ignoring already known beacon payload number=16,026,544 hash=ff1aef..389471 age=1mo1w16h
WARN [12-30|08:30:56.725] Ignoring already known beacon payload number=16,026,545 hash=6767aa..85bf1d age=1mo1w16h
WARN [12-30|08:30:56.730] Ignoring already known beacon payload number=16,026,546 hash=95b736..dc2456 age=1mo1w16h
WARN [12-30|08:30:56.732] Ignoring already known beacon payload number=16,026,547 hash=34e43f..777810 age=1mo1w16h
WARN [12-30|08:30:56.742] Ignoring already known beacon payload number=16,026,548 hash=1c67b8..cbc356 age=1mo1w16h
WARN [12-30|08:30:56.750] Ignoring already known beacon payload number=16,026,549 hash=fe9e47..ed347e age=1mo1w16h
WARN [12-30|08:30:56.754] Ignoring already known beacon payload number=16,026,550 hash=c98bf1..40560a age=1mo1w16h
WARN [12-30|08:30:56.772] Ignoring already known beacon payload number=16,026,551 hash=f55377..a1582e age=1mo1w16h
WARN [12-30|08:30:56.780] Ignoring already known beacon payload number=16,026,552 hash=0bf769..af0ed8 age=1mo1w16h
WARN [12-30|08:30:56.784] Ignoring already known beacon payload number=16,026,553 hash=382866..a5a4f8 age=1mo1w16h
WARN [12-30|08:30:56.907] Ignoring already known beacon payload number=16,026,554 hash=65d2ff..6ebef5 age=1mo1w16h
WARN [12-30|08:30:56.918] Ignoring already known beacon payload number=16,026,555 hash=f04209..4779e9 age=1mo1w16h
WARN [12-30|08:30:56.935] Ignoring already known beacon payload number=16,026,556 hash=f2b1ab..373dc0 age=1mo1w16h
WARN [12-30|08:30:56.943] Ignoring already known beacon payload number=16,026,557 hash=979712..00891d age=1mo1w16h
WARN [12-30|08:30:56.956] Ignoring already known beacon payload number=16,026,558 hash=b53705..8483a1 age=1mo1w16h
WARN [12-30|08:30:56.979] Ignoring already known beacon payload number=16,026,559 hash=61e689..7e7c79 age=1mo1w16h
WARN [12-30|08:30:56.994] Ignoring already known beacon payload number=16,026,560 hash=a0f45b..802daf age=1mo1w16h
WARN [12-30|08:30:57.004] Ignoring already known beacon payload number=16,026,561 hash=037435..474e8d age=1mo1w16h
WARN [12-30|08:30:57.009] Ignoring already known beacon payload number=16,026,562 hash=565f15..bf9980 age=1mo1w16h
WARN [12-30|08:30:57.031] Ignoring already known beacon payload number=16,026,563 hash=c7f6ef..cc5ddf age=1mo1w16h
WARN [12-30|08:30:57.033] Ignoring already known beacon payload number=16,026,564 hash=c87d53..223987 age=1mo1w16h
WARN [12-30|08:30:57.068] Ignoring already known beacon payload number=16,026,565 hash=51f821..fc1a26 age=1mo1w16h
Has anyone encountered similar problems before?
I already tried to sync again. It didn`t help.

Timeout while streaming messages from message queue

I am processing messages from IBM MQ with a Scala program. It was working fine and stopped working without any code change.
This timeout occurs without a specific pattern and from time to time.
I run the application like this:
spark-submit --conf spark.streaming.driver.writeAheadLog.allowBatching=true --conf spark.streaming.driver.writeAheadLog.batchingTimeout=15000 --class com.ibm.spark.streaming.mq.SparkMQExample --master yarn --deploy-mode client --num-executors 1 $jar_file_loc lots of args here >> script.out.log 2>> script.err.log < /dev/null
I tried two properties:
spark.streaming.driver.writeAheadLog.batchingTimeout 15000
spark.streaming.driver.writeAheadLog.allowBatching true
See error:
2021-12-14 14:13:05 WARN ReceivedBlockTracker:90 - Exception thrown while writing record: BatchAllocationEvent(1639487580000 ms,AllocatedBlocks(Map(0 -> Queue()))) to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:238)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:209)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
2021-12-14 14:13:05 INFO ReceivedBlockTracker:57 - Possibly processed batch 1639487580000 ms needs to be processed again in WAL recovery
2021-12-14 14:13:05 INFO JobScheduler:57 - Added jobs for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
rdd is empty
2021-12-14 14:13:05 INFO JobScheduler:57 - Starting job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Finished job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Total delay: 5.011 s for time 1639487580000 ms (execution: 0.001 s)
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
2021-12-14 14:13:05 INFO BlockRDD:57 - Removing RDD 284 from persistence list
2021-12-14 14:13:05 INFO PluggableInputDStream:57 - Removing blocks of RDD BlockRDD[284] at receiverStream at JmsStreamUtils.scala:64 of time 1639487580000 ms
2021-12-14 14:13:05 INFO BlockManager:57 - Removing RDD 284
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
Any kind of information would be useful. Thank you!

Kafka Broker Not able to start

I am having a 3 node Kafka Cluster. One of the broker is not starting, i am getting below error. I have tried deleting index files but still, same error coming. Please help to understand what is this issue and how can I recover.
INFO [2018-09-05 11:58:49,585] kafka.log.Log:[Logging$class:info:66] - [pool-4-thread-1] - [Log partition=Topic3-15, dir=/var/lib/kafka/kafka-logs] Completed load of log with 1 segments, log start offset 11547004 and log end offset 11559178 in 1552 ms
INFO [2018-09-05 11:58:49,589] kafka.log.Log:[Logging$class:info:66] - [pool-4-thread-1] - [Log partition=Topic3-13, dir=/var/lib/kafka/kafka-logs] Recovering unflushed segment 12399433
ERROR [2018-09-05 11:58:49,591] kafka.log.LogManager:[Logging$class:error:74] - [main] - There was an error in one of the threads during logs loading: java.lang.IllegalArgumentException: inconsistent range
WARN [2018-09-05 11:58:49,591] kafka.log.Log:[Logging$class:warn:70] - [pool-4-thread-1] - [Log partition=Topic3-35, dir=/var/lib/kafka/kafka-logs] Found a corrupted index file corresponding to log file /var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.log due to Corrupt time index found, time index file (/var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1536129815049}, recovering segment and rebuilding index files...
INFO [2018-09-05 11:58:49,594] kafka.log.ProducerStateManager:[Logging$class:info:66] - [pool-4-thread-1] - [ProducerStateManager partition=Topic3-35] Loading producer state from snapshot file '/var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.snapshot'
ERROR [2018-09-05 11:58:49,599] kafka.server.KafkaServer:[MarkerIgnoringBase:error:159] - [main] - [KafkaServer id=2] Fatal error during KafkaServer startup. Prepare to shutdown
java.lang.IllegalArgumentException: inconsistent range
at java.util.concurrent.ConcurrentSkipListMap$SubMap.(ConcurrentSkipListMap.java:2620)
at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:2078)
at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:2114)
at kafka.log.Log$$anonfun$12.apply(Log.scala:1561)
at kafka.log.Log$$anonfun$12.apply(Log.scala:1560)
at scala.Option.map(Option.scala:146)
at kafka.log.Log.logSegments(Log.scala:1560)
at kafka.log.Log.kafka$log$Log$$recoverSegment(Log.scala:358)
at kafka.log.Log.recoverLog(Log.scala:448)
at kafka.log.Log.loadSegments(Log.scala:421)
at kafka.log.Log.(Log.scala:216)
at kafka.log.Log$.apply(Log.scala:1747)
at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:255)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:335)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
INFO [2018-09-05 11:58:49,606] kafka.server.KafkaServer:[Logging$class:info:66] - [main] - [KafkaServer id=2] shutting down

Airflow scheduler keep on Failing jobs without heartbeat

I'm new to airflow and i tried to manually trigger a job through UI. When I did that, the scheduler keep on logging that it is Failing jobs without heartbeat as follows:
[2018-05-28 12:13:48,248] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:48,250] {jobs.py:1672} INFO - Heartbeating the scheduler
[2018-05-28 12:13:48,259] {jobs.py:368} INFO - Started process (PID=58141) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,264] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:48,265] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,275] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,298] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:48,299] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:48.299278
[2018-05-28 12:13:48,304] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.045 seconds
[2018-05-28 12:13:49,266] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:49,267] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:49,271] {dag_processing.py:537} INFO - Started a process (PID: 58149) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,272] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:49,283] {jobs.py:368} INFO - Started process (PID=58149) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,288] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:49,289] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,300] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,326] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:49,327] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:49.327218
[2018-05-28 12:13:49,332] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.049 seconds
[2018-05-28 12:13:50,279] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:50,280] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:50,283] {dag_processing.py:537} INFO - Started a process (PID: 58150) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,285] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:50,296] {jobs.py:368} INFO - Started process (PID=58150) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,301] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:50,302] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,312] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,338] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:50,339] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:50.339147
[2018-05-28 12:13:50,344] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.048 seconds
And the status of job on UI is stuck at running. Is there something i need to configure to solve this issue?
It seems that it's not a "Failing jobs" problem but a logging problem. Here's what I found when I tried to fix this problem.
Is this message indicates that there's something wrong that I should
be concerned?
No.
"Finding 'running' jobs" and "Failing jobs..." are INFO level logs
generated from find_zombies function of heartbeat utility. So there will be logs generated every
heartbeat interval even if you don't have any failing jobs
running.
How do I turn it off?
The logging_level option in airflow.cfg does not control the scheduler logging.
There's one hard-code in
airflow/settings.py:
LOGGING_LEVEL = logging.INFO
You could change this to:
LOGGING_LEVEL = logging.WARN
Then restart the scheduler and the problem will be gone.
I think in point 2 if you just change the logging_level = INFO to WARN in airflow.cfg, you won't get INFO log. you don't need to modify settings.py file.

Why are the executors getting killed by the driver?

The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.