executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown - how I can debug on driver side? - scala

I'm getting that logs from the executor (beginning at the buttom):
2021-11-30 21:44:42
2021-11-30 18:44:42,911 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)) - Deleting directory /var/data/spark-0646270c-a2d0-47d4-8e6c-0bc735bc255d/spark-a54cf7e4-baaf-4411-9073-0c1fb1e4cc5b
2021-11-30 21:44:42
2021-11-30 18:44:42,910 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(57)) - Shutdown hook called
2021-11-30 21:44:42
2021-11-30 18:44:42,902 ERROR [SIGTERM handler] executor.CoarseGrainedExecutorBackend (SignalUtils.scala:$anonfun$registerLogger$2(43)) - RECEIVED SIGNAL TERM
2021-11-30 21:44:42
2021-11-30 18:44:42,823 INFO [CoarseGrainedExecutorBackend-stop-executor] storage.BlockManager (Logging.scala:logInfo(57)) - BlockManager stopped
2021-11-30 21:44:42
2021-11-30 18:44:42,822 INFO [CoarseGrainedExecutorBackend-stop-executor] memory.MemoryStore (Logging.scala:logInfo(57)) - MemoryStore cleared
2021-11-30 21:44:42
2021-11-30 18:44:42,798 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(57)) - Driver commanded a shutdown
How I can enable any kind of logging in the Spark Driver to understand, what kind of event on the Driver has triggered the executor to shutdown? There is no lack of the memory to Driver or Executor, the pod metrics show that they occupy much more than it is limited + overhead. So, looks like the reason of shutdown signal isn't a lack of the resoures, but may be some hidden exception, not logged anywhere.
According to the advice of #mazaneicha I have tried to set longer timeouts, but still getting the same error
implicit val spark: SparkSession = SparkSession
.builder
.master("local[1]")
.config(new SparkConf().setIfMissing("spark.master", "local[1]")
.set("spark.eventLog.dir", "file:///tmp/spark-events")
.set("spark.dynamicAllocation.executorIdleTimeout", "100s") //spark.dynamicAllocation.executorIdleTimeout
.set("spark.dynamicAllocation.schedulerBacklogTimeout", "100s") //spark.dynamicAllocation.schedulerBacklogTimeout
)
.getOrCreate()

The reason of the failure was actually posted to the logs:
2021-12-01 15:05:46,906 WARN [main] streaming.StreamingQueryManager (Logging.scala:logWarning(69)) - Stopping existing streaming query [id=b13a69d7-5a2f-461e-91a7-a9138c4aa716, runId=9cb31852-d276-42d8-ade6-9839fa97f85c], as a new run is being started.
WHy the query were stopped? That's because in Scala I was creating streaming queries in a loop, based on collection, while keeping all the query names and all the checkpoint names the same. After making them unique (i just used the string values from the collection), the failure problem has gone.

Related

Timeout while streaming messages from message queue

I am processing messages from IBM MQ with a Scala program. It was working fine and stopped working without any code change.
This timeout occurs without a specific pattern and from time to time.
I run the application like this:
spark-submit --conf spark.streaming.driver.writeAheadLog.allowBatching=true --conf spark.streaming.driver.writeAheadLog.batchingTimeout=15000 --class com.ibm.spark.streaming.mq.SparkMQExample --master yarn --deploy-mode client --num-executors 1 $jar_file_loc lots of args here >> script.out.log 2>> script.err.log < /dev/null
I tried two properties:
spark.streaming.driver.writeAheadLog.batchingTimeout 15000
spark.streaming.driver.writeAheadLog.allowBatching true
See error:
2021-12-14 14:13:05 WARN ReceivedBlockTracker:90 - Exception thrown while writing record: BatchAllocationEvent(1639487580000 ms,AllocatedBlocks(Map(0 -> Queue()))) to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
at org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:238)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.allocateBlocksToBatch(ReceivedBlockTracker.scala:118)
at org.apache.spark.streaming.scheduler.ReceiverTracker.allocateBlocksToBatch(ReceiverTracker.scala:209)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:248)
at org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$3.apply(JobGenerator.scala:247)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.JobGenerator.generateJobs(JobGenerator.scala:247)
at org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:183)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:89)
at org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:88)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
2021-12-14 14:13:05 INFO ReceivedBlockTracker:57 - Possibly processed batch 1639487580000 ms needs to be processed again in WAL recovery
2021-12-14 14:13:05 INFO JobScheduler:57 - Added jobs for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
rdd is empty
2021-12-14 14:13:05 INFO JobScheduler:57 - Starting job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Finished job streaming job 1639487580000 ms.0 from job set of time 1639487580000 ms
2021-12-14 14:13:05 INFO JobScheduler:57 - Total delay: 5.011 s for time 1639487580000 ms (execution: 0.001 s)
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
2021-12-14 14:13:05 INFO BlockRDD:57 - Removing RDD 284 from persistence list
2021-12-14 14:13:05 INFO PluggableInputDStream:57 - Removing blocks of RDD BlockRDD[284] at receiverStream at JmsStreamUtils.scala:64 of time 1639487580000 ms
2021-12-14 14:13:05 INFO BlockManager:57 - Removing RDD 284
2021-12-14 14:13:05 INFO JobGenerator:57 - Checkpointing graph for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updating checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO DStreamGraph:57 - Updated checkpoint data for time 1639487580000 ms
2021-12-14 14:13:05 INFO CheckpointWriter:57 - Submitted checkpoint of time 1639487580000 ms to writer queue
Any kind of information would be useful. Thank you!

JobManager doesn't automatically redirect all requests to the remaining / running TaskManager

Problem Description
2 computers(203,204)
created a Standalone mode HA Flink v1.6.1 cluster
both run jobmanager and taskmanager(2 task slots) on every computer
After I start a job (examples SocketWindowWordCount.jar ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000) on the JobManager node, I kill the working TaskManager instance.
Web Dashboard I can see the job being cancelled and then failed. Web Dashboard image
flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab
full logs
log-standalonesession-203
log-taskexecutor-203
log-standalonesession-204
exception
kill working TM, get the excpetion like this
2018-12-28 11:04:27,877 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.
submit job
run ./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9, get the JM logs like this
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager akka.tcp://flink#hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO org.apache.flink.runtime.jobmaster.JobMaster - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6#akka.tcp://flink#hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO org.apache.flink.runtime.blob.BlobClient - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124
kill TM
Body is limited to 30000 characters. please read this JM logs when kill TM
The logs indicate that your RestartStrategy has depleted its restart attempts or that no RestartStrategy has been configured. Please check whether you specified a RestartStrategy in your program via env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) or in flink-conf.yaml via restart-strategy: fixed-delay. If you want to learn more about Flink's restart strategies check out the documentation.

What does one enter on the command line to run spark in a bokeh serve app? Do I simply separate the two command line entries by &&?

My effort does not work:
/usr/local/spark/spark-2.3.2-bin-hadoop2.7/bin/spark-submit --driver-memory 6g --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.2 runspark.py && bokeh serve --show bokeh_app
runspark.py contains the instantiation of spark, and bokeh_app is the folder of the bokeh server app. spark is being used to update a streaming dask dataframe.
WHAT HAPPENS:
The spark instance starts running, loads as it normally would without the bokeh server. However as soon as the bokeh server app kicks in (i.e.) the web page opens, the spark instance shuts down. It doesn't send back any errors in the console output.
OUTPUT BELOW:
2018-11-26 21:04:05 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4f0492c9{/static/sql,null,AVAILABLE,#Spark}
2018-11-26 21:04:06 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-11-26 21:04:06 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-11-26 21:04:06 INFO AbstractConnector:318 - Stopped Spark#4f3c4272{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
2018-11-26 21:04:06 INFO SparkUI:54 - Stopped Spark web UI at http://192.168.1.25:4041
2018-11-26 21:04:06 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-11-26 21:04:06 INFO MemoryStore:54 - MemoryStore cleared
2018-11-26 21:04:06 INFO BlockManager:54 - BlockManager stopped
2018-11-26 21:04:06 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-11-26 21:04:07 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-11-26 21:04:07 INFO SparkContext:54 - Successfully stopped SparkContext
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Shutdown hook called
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-c42ce0b3-d49e-48ce-962c-277b42166267
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f
2018-11-26 21:04:07 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bd448b2e-6b0f-467a-9e43-689542c42a6f/pyspark-117d2a10-7cb9-4eb3-b4d0-f92f9046522c
2018-11-26 21:04:08,542 Starting Bokeh server version 0.13.0 (running on Tornado 5.1.1)
2018-11-26 21:04:08,547 Bokeh app running at: http://localhost:5006/aion_analytics
2018-11-26 21:04:08,547 Starting Bokeh server with process id: 10769
Ok, I found the answer. The idea is simply to embed the bokeh server in the pyspark code instead of running the bokeh server from the command line. Use the pyspark submit command as normal.
https://github.com/bokeh/bokeh/blob/1.0.1/examples/howto/server_embed/standalone_embed.py
I did exactly what shown in the link above.

zeppelin spark context closed after one paragraph

I have a notebook in Zeppelin containing multiple paragraphs which were running fine earlier; suddenly, after a cluster restart, it has started behaving weirdly.
The first paragraph runs fine while anything that runs afterwards says Connection Refused.
On checking the logs in $ZEPPELIN_HOME/logs folder zeppelin-interpreter-spark-root-mn.log (where mn is machine name).
INFO [2018-02-21 21:42:43,301] ({dispatcher-event-loop-15} Logging.scala[logInfo]:54) - Removed broadcast_12_piece0 on mn5:45284 in memory (size: 88.2 KB, free: 2004.5 MB)
INFO [2018-02-21 21:42:43,401] ({Thread-3} Logging.scala[logInfo]:54) - Invoking stop() from shutdown hook
INFO [2018-02-21 21:42:43,412] ({Thread-3} AbstractConnector.java[doStop]:310) - Stopped Spark#7de3e842{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
INFO [2018-02-21 21:42:43,416] ({Thread-3} Logging.scala[logInfo]:54) - Stopped Spark web UI at http://10.28.37.82:4040
INFO [2018-02-21 21:42:43,440] ({Yarn application state monitor} Logging.scala[logInfo]:54) - Interrupting monitor thread
INFO [2018-02-21 21:42:43,442] ({Thread-3} Logging.scala[logInfo]:54) - Shutting down all executors
INFO [2018-02-21 21:42:43,443] ({dispatcher-event-loop-4} Logging.scala[logInfo]:54) - Asking each executor to shut down
INFO [2018-02-21 21:42:43,447] ({Thread-3} Logging.scala[logInfo]:54) - Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
INFO [2018-02-21 21:42:43,450] ({Thread-3} Logging.scala[logInfo]:54) - Stopped
INFO [2018-02-21 21:42:43,454] ({dispatcher-event-loop-9} Logging.scala[logInfo]:54) - MapOutputTrackerMasterEndpoint stopped!
INFO [2018-02-21 21:42:43,466] ({Thread-3} Logging.scala[logInfo]:54) - MemoryStore cleared
INFO [2018-02-21 21:42:43,466] ({Thread-3} Logging.scala[logInfo]:54) - BlockManager stopped
INFO [2018-02-21 21:42:43,467] ({Thread-3} Logging.scala[logInfo]:54) - BlockManagerMaster stopped
INFO [2018-02-21 21:42:43,471] ({dispatcher-event-loop-0} Logging.scala[logInfo]:54) - OutputCommitCoordinator stopped!
INFO [2018-02-21 21:42:43,472] ({Thread-3} Logging.scala[logInfo]:54) - Successfully stopped SparkContext
INFO [2018-02-21 21:42:43,473] ({Thread-3} Logging.scala[logInfo]:54) - Shutdown hook called
So the shut down hook is getting called. I have tried to check other posts on SO (like this and this) but it didn't help. Logs are not much helpful either.
Do I need to tweak code to add additional logging to fix this problem? Has someone has already faced and resolved the same?
It turned out to be the case of bad logging. I had checked yarn logs as well but couldn't find anything. It turns out that second paragraph had a RunTimeException which wasn't clear from any of the logs, but when i tried same command on spark-shell then i Realized what the problem was and fixed the same.
Run the scala command in spark-shell then see what exception it is throwing.

Using Scala IDE and Apache Spark on Windows

I want to start working on a project that uses Spark with Scala on Windows 7.
I downloaded the Apache Spark pre-build for hadoop 2.4 (download page) and I can run it from command prompt (cmd). I can run all of the codes on the quick start of spark page before self-contains application section.
Then I downloaded Scala IDE 4.0.0 from its download page (Sorry it's not possible to post more than 2 links).
Now I created a new scala project and also import the spark assembly jar file into the project. When I want to run the example in the self-contains application section in quick start of spark page but I got the following errors:
15/03/26 11:59:55 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#myhost:7077/user/Master...
15/03/26 11:59:58 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#myhost:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#myhost:7077
15/03/26 11:59:58 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#myhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: myhost
15/03/26 12:00:15 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#myhost:7077/user/Master...
15/03/26 12:00:17 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#myhost:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#myhost:7077
15/03/26 12:00:17 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#myhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: myhost
15/03/26 12:00:35 INFO AppClient$ClientActor: Connecting to master akka.tcp://sparkMaster#myhost:7077/user/Master...
15/03/26 12:00:37 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster#myhost:7077: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster#myhost:7077
15/03/26 12:00:37 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkMaster#myhost:7077]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: myhost
15/03/26 12:00:55 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
15/03/26 12:00:55 ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up.
15/03/26 12:00:55 WARN SparkDeploySchedulerBackend: Application ID is not initialized yet.
The only line of code that I add to the example, is .setMaster("spark://myhost:7077") for SparkConf definition. I think I need to configure the Scala IDE to use the pre-build spark on my computer but actually I don't know how and I couldn't find anything by googling.
Could you help me to get Scala IDE works with the spark on windows 7?
Thanks in advance
I found the answer:
I should correct the master definition in my code as follow:
replace:
.setMaster("spark://myhost:7077")
with:
.setMaster("local[*]")
Hope that it helps you as well.