geth stopped working after the merge, when I fix it by beacon-chain, beacon stopped progress at 99%.
command:
prysm.sh beacon-chain --execution-endpoint=http://localhost:8551 --datadir=/disk1/prysm/.eth2
logs:
[2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0x35967a9b... 4700032/4737924 - estimated time remaining 22m9s blocksPerSecond=28.5 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x97b0b7b53582569689c52dbee87990ea2d7a94b17ee823e704c99d07e81b5376 (in processBatchedBlocks, slot=4700032) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0x063c579d... 4700096/4737924 - estimated time remaining 19m55s blocksPerSecond=31.6 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x91593f4ce1da4260a4475807af54ada66481b2e5529859fbcdd636c59966ac5d (in processBatchedBlocks, slot=4700096) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 62 starting from 0xe5a59df5... 4700160/4737924 - estimated time remaining 18m6s blocksPerSecond=34.8 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x099ce628bdb98cd34673e06f779a695a9fa903472f95f778a823c4b271296669 (in processBatchedBlocks, slot=4700160) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0xf0b0a565... 4700224/4737924 - estimated time remaining 16m33s blocksPerSecond=38.0 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x6a644e5ce7eb9063ac0334eb070469ffe1babef71b42fc295a0098410c8509ff (in processBatchedBlocks, slot=4700224) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0x4aef416e... 4700288/4737924 - estimated time remaining 15m14s blocksPerSecond=41.1 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xa165b5186776ed6adbeffbe7f9861a25cfe9e9a79b79fbf63c44f0f3f0fd2433 (in processBatchedBlocks, slot=4700288) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 64 starting from 0xe2ad65e3... 4700352/4737924 - estimated time remaining 14m7s blocksPerSecond=44.4 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x0d5b8ab2983591a9dd27b6b6b99540f75de7a6b7f88dfe6dd83ac5e8316b0d79 (in processBatchedBlocks, slot=4700352) [2022-09-20 13:05:11] INFO initial-sync: Processing block batch of size 63 starting from 0xe08fde61... 4700416/4737924 - estimated time remaining 13m9s blocksPerSecond=47.5 peers=47 [2022-09-20 13:05:11] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xb493b115a9e7dadff196d1fd9092c477b3503a148983a6cd37111a00ba526862 (in processBatchedBlocks, slot=4700416)
I'm getting same error, you can check your GETH execution node, I think it stuck because of the GETH error
WARN [09-21|18:45:48.100] Ignoring already known beacon payload number=15,580,285 hash=d2b656..2c59c3 age=3h23m49s
My beacon-chain got these error/warning
[2022-09-21 18:48:26] WARN powchain: Execution client is not syncing
[2022-09-21 18:48:37] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x43b75c52f29244ca3caee03c5d3dbc52ac19e12d1fd8d2ae3e28c358719cb028 (in processBatchedBlocks, slot=4743744)
my eth.blockNumber is always 15,580,285
Related
I try to sync my Geth, but it stuck.
I see next errors in Prysm
[2022-12-30 08:24:56] INFO p2p: Peer summary activePeers=42 inbound=0 outbound=42
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463723 - estimated time remaining 7h50m56s blocksPerSecond=9.6 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463723 - estimated time remaining 5h53m7s blocksPerSecond=12.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463723 - estimated time remaining 4h45m6s blocksPerSecond=15.8 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:24:59] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463723 - estimated time remaining 3h58m24s blocksPerSecond=18.9 peers=41
[2022-12-30 08:24:59] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:09] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463723 - estimated time remaining 3h58m41s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:16] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x0ed9c790... 5192400/5463724 - estimated time remaining 11h57m47s blocksPerSecond=6.3 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x4f227a74a20bebe01369ac220aedacd7e4c986e8f569694f3c643da0cc9cfe83 (in processBatchedBlocks, slot=5192400)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x87ca0daa... 5192464/5463724 - estimated time remaining 7h55m53s blocksPerSecond=9.5 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x1a0877b34ccd9e92bdfe7859dbd3478d1c785fd6e970d8b8b637d5c457efec83 (in processBatchedBlocks, slot=5192464)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0xa48361f2... 5192528/5463724 - estimated time remaining 5h55m54s blocksPerSecond=12.7 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xc569dd3f4241031b835bd7dd528d2337cca5b0c8315ae55f56b810d1b0a1aa40 (in processBatchedBlocks, slot=5192528)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 61 starting from 0x06de2537... 5192592/5463724 - estimated time remaining 4h46m54s blocksPerSecond=15.8 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xf9a776157c86f918aa82c39a69a2a880b0eb42a692bc6e1a97b3fe2fce73c916 (in processBatchedBlocks, slot=5192592)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 62 starting from 0x865d9e60... 5192656/5463724 - estimated time remaining 3h59m40s blocksPerSecond=18.9 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0x03140ee59185dc417bf95c370ed4070458855da01b54d0b2186d9e11fd328314 (in processBatchedBlocks, slot=5192656)
[2022-12-30 08:25:20] INFO initial-sync: Processing block batch of size 64 starting from 0x74b66233... 5192720/5463724 - estimated time remaining 3h24m50s blocksPerSecond=22.1 peers=44
[2022-12-30 08:25:20] WARN initial-sync: Skip processing batched blocks error=beacon node doesn't have a parent in db with root: 0xed2e36835d677750e41fecc7a16c7d8669fd8a8ff629c9480b575d6d27c26085 (in processBatchedBlocks, slot=5192720)
[2022-12-30 08:25:27] INFO initial-sync: Processing block batch of size 64 starting from 0x33e52bbd... 5192336/5463725 - estimated time remaining 2h59m8s blocksPerSecond=25.2 peers=45
[2022-12-30 08:25:33] WARN initial-sync: Skip processing batched blocks error=could not process block in batch: could not set node to invalid: invalid nil or unknown node
And next I see in Geth
ERROR[12-30|08:30:48.985] Error in block freeze operation err="block receipts missing, can't freeze block 15850304"
WARN [12-30|08:30:55.145] Previously seen beacon client is offline. Please ensure it is operational to follow the chain!
WARN [12-30|08:30:56.688] Ignoring already known beacon payload number=16,026,540 hash=145907..4f081f age=1mo1w16h
WARN [12-30|08:30:56.696] Ignoring already known beacon payload number=16,026,541 hash=6acd62..e3bac2 age=1mo1w16h
WARN [12-30|08:30:56.701] Ignoring already known beacon payload number=16,026,542 hash=1da1d8..c55a69 age=1mo1w16h
WARN [12-30|08:30:56.708] Ignoring already known beacon payload number=16,026,543 hash=762b24..f56957 age=1mo1w16h
WARN [12-30|08:30:56.719] Ignoring already known beacon payload number=16,026,544 hash=ff1aef..389471 age=1mo1w16h
WARN [12-30|08:30:56.725] Ignoring already known beacon payload number=16,026,545 hash=6767aa..85bf1d age=1mo1w16h
WARN [12-30|08:30:56.730] Ignoring already known beacon payload number=16,026,546 hash=95b736..dc2456 age=1mo1w16h
WARN [12-30|08:30:56.732] Ignoring already known beacon payload number=16,026,547 hash=34e43f..777810 age=1mo1w16h
WARN [12-30|08:30:56.742] Ignoring already known beacon payload number=16,026,548 hash=1c67b8..cbc356 age=1mo1w16h
WARN [12-30|08:30:56.750] Ignoring already known beacon payload number=16,026,549 hash=fe9e47..ed347e age=1mo1w16h
WARN [12-30|08:30:56.754] Ignoring already known beacon payload number=16,026,550 hash=c98bf1..40560a age=1mo1w16h
WARN [12-30|08:30:56.772] Ignoring already known beacon payload number=16,026,551 hash=f55377..a1582e age=1mo1w16h
WARN [12-30|08:30:56.780] Ignoring already known beacon payload number=16,026,552 hash=0bf769..af0ed8 age=1mo1w16h
WARN [12-30|08:30:56.784] Ignoring already known beacon payload number=16,026,553 hash=382866..a5a4f8 age=1mo1w16h
WARN [12-30|08:30:56.907] Ignoring already known beacon payload number=16,026,554 hash=65d2ff..6ebef5 age=1mo1w16h
WARN [12-30|08:30:56.918] Ignoring already known beacon payload number=16,026,555 hash=f04209..4779e9 age=1mo1w16h
WARN [12-30|08:30:56.935] Ignoring already known beacon payload number=16,026,556 hash=f2b1ab..373dc0 age=1mo1w16h
WARN [12-30|08:30:56.943] Ignoring already known beacon payload number=16,026,557 hash=979712..00891d age=1mo1w16h
WARN [12-30|08:30:56.956] Ignoring already known beacon payload number=16,026,558 hash=b53705..8483a1 age=1mo1w16h
WARN [12-30|08:30:56.979] Ignoring already known beacon payload number=16,026,559 hash=61e689..7e7c79 age=1mo1w16h
WARN [12-30|08:30:56.994] Ignoring already known beacon payload number=16,026,560 hash=a0f45b..802daf age=1mo1w16h
WARN [12-30|08:30:57.004] Ignoring already known beacon payload number=16,026,561 hash=037435..474e8d age=1mo1w16h
WARN [12-30|08:30:57.009] Ignoring already known beacon payload number=16,026,562 hash=565f15..bf9980 age=1mo1w16h
WARN [12-30|08:30:57.031] Ignoring already known beacon payload number=16,026,563 hash=c7f6ef..cc5ddf age=1mo1w16h
WARN [12-30|08:30:57.033] Ignoring already known beacon payload number=16,026,564 hash=c87d53..223987 age=1mo1w16h
WARN [12-30|08:30:57.068] Ignoring already known beacon payload number=16,026,565 hash=51f821..fc1a26 age=1mo1w16h
Has anyone encountered similar problems before?
I already tried to sync again. It didn`t help.
I am processing messages from IBM MQ with a Scala program. It was working fine and stopped working without any code change.
This timeout occurs without a specific pattern and from time to time.
I run the application like this:
spark-submit --conf spark.streaming.driver.writeAheadLog.allowBatching=true --conf spark.streaming.driver.writeAheadLog.batchingTimeout=15000 --class com.ibm.spark.streaming.mq.SparkMQExample --master yarn --deploy-mode client --num-executors 1 $jar_file_loc lots of args here >> script.out.log 2>> script.err.log < /dev/null
I tried two properties:
spark.streaming.driver.writeAheadLog.batchingTimeout 15000
spark.streaming.driver.writeAheadLog.allowBatching true
See error:
2021-12-14 14:13:05 WARN ReceivedBlockTracker:90 - Exception thrown while writing record: BatchAllocationEvent(1639487580000 ms,AllocatedBlocks(Map(0 -> Queue()))) to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:238)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:209)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
2021-12-14 14:13:05 INFO ReceivedBlockTracker:57 - Possibly processed batch 1639487580000 ms needs to be processed again in WAL recovery
2021-12-14 14:13:05 INFO JobScheduler:57 - Added jobs for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
rdd is empty
2021-12-14 14:13:05 INFO JobScheduler:57 - Starting job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Finished job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Total delay: 5.011 s for time 1639487580000 ms (execution: 0.001 s)
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
2021-12-14 14:13:05 INFO BlockRDD:57 - Removing RDD 284 from persistence list
2021-12-14 14:13:05 INFO PluggableInputDStream:57 - Removing blocks of RDD BlockRDD[284] at receiverStream at JmsStreamUtils.scala:64 of time 1639487580000 ms
2021-12-14 14:13:05 INFO BlockManager:57 - Removing RDD 284
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
Any kind of information would be useful. Thank you!
I am having a 3 node Kafka Cluster. One of the broker is not starting, i am getting below error. I have tried deleting index files but still, same error coming. Please help to understand what is this issue and how can I recover.
INFO [2018-09-05 11:58:49,585] kafka.log.Log:[Logging$class:info:66] - [pool-4-thread-1] - [Log partition=Topic3-15, dir=/var/lib/kafka/kafka-logs] Completed load of log with 1 segments, log start offset 11547004 and log end offset 11559178 in 1552 ms
INFO [2018-09-05 11:58:49,589] kafka.log.Log:[Logging$class:info:66] - [pool-4-thread-1] - [Log partition=Topic3-13, dir=/var/lib/kafka/kafka-logs] Recovering unflushed segment 12399433
ERROR [2018-09-05 11:58:49,591] kafka.log.LogManager:[Logging$class:error:74] - [main] - There was an error in one of the threads during logs loading: java.lang.IllegalArgumentException: inconsistent range
WARN [2018-09-05 11:58:49,591] kafka.log.Log:[Logging$class:warn:70] - [pool-4-thread-1] - [Log partition=Topic3-35, dir=/var/lib/kafka/kafka-logs] Found a corrupted index file corresponding to log file /var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.log due to Corrupt time index found, time index file (/var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.timeindex) has non-zero size but the last timestamp is 0 which is less than the first timestamp 1536129815049}, recovering segment and rebuilding index files...
INFO [2018-09-05 11:58:49,594] kafka.log.ProducerStateManager:[Logging$class:info:66] - [pool-4-thread-1] - [ProducerStateManager partition=Topic3-35] Loading producer state from snapshot file '/var/lib/kafka/kafka-logs/Topic3-35/00000000000011110038.snapshot'
ERROR [2018-09-05 11:58:49,599] kafka.server.KafkaServer:[MarkerIgnoringBase:error:159] - [main] - [KafkaServer id=2] Fatal error during KafkaServer startup. Prepare to shutdown
java.lang.IllegalArgumentException: inconsistent range
at java.util.concurrent.ConcurrentSkipListMap$SubMap.(ConcurrentSkipListMap.java:2620)
at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:2078)
at java.util.concurrent.ConcurrentSkipListMap.subMap(ConcurrentSkipListMap.java:2114)
at kafka.log.Log$$anonfun$12.apply(Log.scala:1561)
at kafka.log.Log$$anonfun$12.apply(Log.scala:1560)
at scala.Option.map(Option.scala:146)
at kafka.log.Log.logSegments(Log.scala:1560)
at kafka.log.Log.kafka$log$Log$$recoverSegment(Log.scala:358)
at kafka.log.Log.recoverLog(Log.scala:448)
at kafka.log.Log.loadSegments(Log.scala:421)
at kafka.log.Log.(Log.scala:216)
at kafka.log.Log$.apply(Log.scala:1747)
at kafka.log.LogManager.kafka$log$LogManager$$loadLog(LogManager.scala:255)
at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$11$$anonfun$apply$15$$anonfun$apply$2.apply$mcV$sp(LogManager.scala:335)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:62)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
INFO [2018-09-05 11:58:49,606] kafka.server.KafkaServer:[Logging$class:info:66] - [main] - [KafkaServer id=2] shutting down
I'm new to airflow and i tried to manually trigger a job through UI. When I did that, the scheduler keep on logging that it is Failing jobs without heartbeat as follows:
[2018-05-28 12:13:48,248] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:48,250] {jobs.py:1672} INFO - Heartbeating the scheduler
[2018-05-28 12:13:48,259] {jobs.py:368} INFO - Started process (PID=58141) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,264] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:48,265] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,275] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:48,298] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:48,299] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:48.299278
[2018-05-28 12:13:48,304] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.045 seconds
[2018-05-28 12:13:49,266] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:49,267] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:49,271] {dag_processing.py:537} INFO - Started a process (PID: 58149) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,272] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:49,283] {jobs.py:368} INFO - Started process (PID=58149) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,288] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:49,289] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,300] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:49,326] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:49,327] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:49.327218
[2018-05-28 12:13:49,332] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.049 seconds
[2018-05-28 12:13:50,279] {jobs.py:1627} INFO - Heartbeating the process manager
[2018-05-28 12:13:50,280] {dag_processing.py:468} INFO - Processor for /Users/gkumar6/airflow/dags/tutorial.py finished
[2018-05-28 12:13:50,283] {dag_processing.py:537} INFO - Started a process (PID: 58150) to generate tasks for /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,285] {jobs.py:1662} INFO - Heartbeating the executor
[2018-05-28 12:13:50,296] {jobs.py:368} INFO - Started process (PID=58150) to work on /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,301] {jobs.py:1742} INFO - Processing file /Users/gkumar6/airflow/dags/tutorial.py for tasks to queue
[2018-05-28 12:13:50,302] {models.py:189} INFO - Filling up the DagBag from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,312] {jobs.py:1754} INFO - DAG(s) ['tutorial'] retrieved from /Users/gkumar6/airflow/dags/tutorial.py
[2018-05-28 12:13:50,338] {models.py:341} INFO - Finding 'running' jobs without a recent heartbeat
[2018-05-28 12:13:50,339] {models.py:345} INFO - Failing jobs without heartbeat after 2018-05-28 06:38:50.339147
[2018-05-28 12:13:50,344] {jobs.py:375} INFO - Processing /Users/gkumar6/airflow/dags/tutorial.py took 0.048 seconds
And the status of job on UI is stuck at running. Is there something i need to configure to solve this issue?
It seems that it's not a "Failing jobs" problem but a logging problem. Here's what I found when I tried to fix this problem.
Is this message indicates that there's something wrong that I should
be concerned?
No.
"Finding 'running' jobs" and "Failing jobs..." are INFO level logs
generated from find_zombies function of heartbeat utility. So there will be logs generated every
heartbeat interval even if you don't have any failing jobs
running.
How do I turn it off?
The logging_level option in airflow.cfg does not control the scheduler logging.
There's one hard-code in
airflow/settings.py:
LOGGING_LEVEL = logging.INFO
You could change this to:
LOGGING_LEVEL = logging.WARN
Then restart the scheduler and the problem will be gone.
I think in point 2 if you just change the logging_level = INFO to WARN in airflow.cfg, you won't get INFO log. you don't need to modify settings.py file.
The first stage of my spark job is quite simple.
It reads from a big number of files (around 30,000 files and 100GB in total) -> RDD[String]
does a map (to parse each line) -> RDD[Map[String,Any]]
filters -> RDD[Map[String,Any]]
coalesces (.coalesce(100, true))
When running it, I observe a quite peculiar behavior. The number of executors grows until the given limit I specified in spark.dynamicAllocation.maxExecutors (typically 100 or 200 in my application). Then it starts decreasing quickly (at approx. 14000/33428 tasks) and only a few executors remain. They are killed by the drive. When this task is done. The number of executors increases back to its maximum value.
Below is a screenshot of the number of executors at its lowest.
An here is a screenshot of the task summary.
I guess that these executors are killed because they are idle. But, in this case, I do not understand why would they become idle. There remains a lot of task to do in the stage...
Do you have any idea of why it happens?
EDIT
More details about the driver logs when an executor is killed:
16/09/30 12:23:33 INFO cluster.YarnClusterSchedulerBackend: Disabling executor 91.
16/09/30 12:23:33 INFO scheduler.DAGScheduler: Executor lost: 91 (epoch 0)
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 91 from BlockManagerMaster.
16/09/30 12:23:33 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(91, server.com, 40923)
16/09/30 12:23:33 INFO storage.BlockManagerMaster: Removed 91 successfully in removeExecutor
16/09/30 12:23:33 INFO cluster.YarnClusterScheduler: Executor 91 on server.com killed by driver.
16/09/30 12:23:33 INFO spark.ExecutorAllocationManager: Existing executor 91 has been removed (new total is 94)
Logs on the executor
16/09/30 12:26:28 INFO rdd.HadoopRDD: Input split: hdfs://...
16/09/30 12:26:32 INFO executor.Executor: Finished task 38219.0 in stage 0.0 (TID 26519). 2312 bytes result sent to driver
16/09/30 12:27:33 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
16/09/30 12:27:33 INFO storage.DiskBlockManager: Shutdown hook called
16/09/30 12:27:33 INFO util.ShutdownHookManager: Shutdown hook called
I'm seeing this problem on executors that are killed as a result of an idle timeout. I have an exceedingly demanding computational load, but it's mostly computed in a UDF, invisible to Spark. I believe that there's some spark parameter that can be adjusted.
Try looking through the spark.executor parameters in https://spark.apache.org/docs/latest/configuration.html#spark-properties and see if anything jumps out.