Spark driver disassociated and removed by the master - scala

I have a cluster made by two slaves and one master and set up and I submit a jar (scala) to the spark master (192.168.1.64):
spark-submit --master spark://spark-master:7077 --class tests.elements target/scala-2.10/zzz-project_2.10-1.0.jar
After quite sometime running just fine it stops abruptly with the last lines on the terminal being
...
15/08/19 17:45:24 INFO scheduler.TaskSchedulerImpl: Adding task set 411292.0 with 6 tasks
15/08/19 17:45:24 WARN scheduler.TaskSetManager: Stage 411292 contains a task of very large size (2762 KB). The maximum recommended task size is 100 KB.
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 411292.0 (TID 1832, 192.168.1.64, PROCESS_LOCAL, 2828792 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 411292.0 (TID 1833, 192.168.1.62, PROCESS_LOCAL, 2310009 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 411292.0 (TID 1834, 192.168.1.64, PROCESS_LOCAL, 2669188 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 411292.0 (TID 1835, 192.168.1.62, PROCESS_LOCAL, 2295676 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 411292.0 (TID 1836, 192.168.1.64, PROCESS_LOCAL, 2847786 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 411292.0 (TID 1837, 192.168.1.64, PROCESS_LOCAL, 2913528 bytes)
Killed
and the error occurring at the master log is the following:
...
15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/0 on worker worker-20150819160925-192.168.1.64-51640
15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/1 on worker worker-20150819160938-192.168.1.62-38007
15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver#192.168.1.64:46823 got disassociated, removing it.
15/08/19 16:15:44 INFO master.Master: Removing app app-20150819160949-0001
15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:44 WARN master.Master: Application testPageRank is still in progress, it may be terminated abnormally.
...
Both workers have in their logs something like this
...
15/08/19 16:15:49 INFO worker.Worker: Executor app-20150819160949-0001/0 finished with state EXITED message Command exited with code 1 exitStatus 1
15/08/19 16:15:50 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#192.168.1.64:54799] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
and
...
15/08/19 16:15:43 INFO worker.Worker: Executor app-20150819160949-0001/1 finished with state EXITED message Command exited with code 1 exitStatus 1
15/08/19 16:15:43 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#192.168.1.62:53325] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
respectively. The work/app files contain something like this
...
15/08/19 16:15:41 INFO executor.Executor: Finished task 1.0 in stage 387758.0 (TID 1803). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO executor.Executor: Finished task 4.0 in stage 387758.0 (TID 1806). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_5 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 5.0 in stage 387758.0 (TID 1807). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_3 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 3.0 in stage 387758.0 (TID 1805). 1911 bytes result sent to driver
15/08/19 16:15:44 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:45 INFO storage.DiskBlockManager: Shutdown hook called
15/08/19 16:15:46 INFO util.Utils: Shutdown hook called
and
...
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_0 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 2.0 in stage 387758.0 (TID 1804). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO executor.Executor: Finished task 0.0 in stage 387758.0 (TID 1802). 1911 bytes result sent to driver
15/08/19 16:15:42 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
15/08/19 16:15:42 INFO storage.DiskBlockManager: Shutdown hook called
15/08/19 16:15:42 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:42 INFO util.Utils: Shutdown hook called
respectively. There seem to be no other error in hdfs or spark.
I am suspecting that the error lies in the master log, the third line (15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver#192.168.1.64:46823 got disassociated, removing it.) but I can't figure out why. I tried changing the spark.akka.heartbeat.interval to 100 as suggested in some posts but no luck. Anyone would know why it happens and how to solve this? Thanks so much.

As mentioned in a very similar question here WARN ReliableDeliverySupervisor: Association with remote system has failed, address is now gated for [5000] ms. Reason: [Disassociated]
The problem is likely to be the lack of memory. Adding more memory (or in that case more nodes) should solve the problem.
(Alternately, needing less memory should work too of course).

Related

Spark Driver terminates with code 137 and no error message. What is the cause?

My spark program is failing and neither the scheduler, driver or executors are providing any sort of useful error, apart from Exit status 137. What could be causing spark to fail?
The crash seems to happen during the conversion of an RDD to a Dataframe:
val df = sqlc.createDataFrame(processedData, schema).persist()
Right before the crash, the logs look like this:
Scheduler
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 WARN TaskSetManager: Stage 11 contains a task of very large size (22028 KB). The maximum recommended task size is 100 KB.
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 23, 10.141.1.247, executor 1133b735-967d-136c-2bbf-ffcb3884c88c-1548129213980, partition 0, PROCESS_LOCAL, 22557269 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 24, 10.141.3.144, executor a92ceb18-b46a-c986-4672-cab9086c54c2-1548129202094, partition 1, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 2.0 in stage 11.0 (TID 25, 10.141.1.56, executor b9167d92-bed2-fe21-46fd-08f2c6fd1998-1548129206680, partition 2, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 3.0 in stage 11.0 (TID 26, 10.141.3.146, executor 0cf7394b-540d-2a6c-258a-e27bbedbdd0e-1548129212488, partition 3, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:09 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
...
19/01/22 04:13:45 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
19/01/22 04:13:46 INFO JobUtils: driver Terminated -- Exit status 137
19/01/22 04:13:46 INFO JobUtils: driver Restarting -- Restart within policy
Driver
19/01/22 04:01:12 INFO DAGScheduler: Job 7 finished: runJob at SparkHadoopMapReduceWriter.scala:88, took 8.008375 s
19/01/22 04:01:12 INFO SparkHadoopMapReduceWriter: Job job_20190122040104_0032 committed.
19/01/22 04:01:13 INFO MapPartitionsRDD: Removing RDD 28 from persistence list
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
Executors (Some variation of this)
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
19/01/22 04:13:45 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.141.2.48:21297 disassociated! Shutting down.
19/01/22 04:13:45 INFO DiskBlockManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Deleting directory /alloc/spark-ce736cb6-8b8e-4891-b9c7-06ea9d9cf797

Spark unable to read kafka Topic and gives error " unable to connect to zookeeper server within timeout 6000"

I'm trying to Execute the Example program given in Spark Directory on HDP cluster "/spark2/examples/src/main/python/streaming/kafka_wordcount.py" which tries to read kafka topic but gives Zookeeper server timeout error.
Spark is installed on HDP Cluster and Kafka is running on HDF Cluster, both are running on different cluster and are in same VPC on AWS
Command executed to run spark example on HDP cluster is "bin/spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.3.0.jar examples/src/main/python/streaming/kafka_wordcount.py HDF-cluster-ip-address:2181 topic"
Error Image :
enter image description here
-------------------------------------------
Time: 2018-06-20 07:51:56
-------------------------------------------
18/06/20 07:51:56 INFO JobScheduler: Finished job streaming job 1529481116000 ms.0 from job set of time 1529481116000 ms
18/06/20 07:51:56 INFO JobScheduler: Total delay: 0.171 s for time 1529481116000 ms (execution: 0.145 s)
18/06/20 07:51:56 INFO PythonRDD: Removing RDD 94 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 94
18/06/20 07:51:56 INFO BlockRDD: Removing RDD 89 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 89
18/06/20 07:51:56 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[89] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481116000 ms
18/06/20 07:51:56 INFO ReceivedBlockTracker: Deleting batches: 1529481114000 ms
18/06/20 07:51:56 INFO InputInfoTracker: remove old batch metadata: 1529481114000 ms
18/06/20 07:51:57 INFO JobScheduler: Added jobs for time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Starting job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Registering RDD 107 (call at /usr/hdp/2.6.5.0-292/spark2/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py:2257)
18/06/20 07:51:57 INFO DAGScheduler: Got job 27 (runJob at PythonRDD.scala:141) with 1 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 54 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 53)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27 stored as values in memory (estimated size 7.0 KB, free 366.0 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 4.1 KB, free 366.0 MB)
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_27_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 27 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 54.0 with 1 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 54.0 (TID 53, localhost, executor driver, partition 0, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 54.0 (TID 53)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -881, init = 921, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -881, init = 922, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 54.0 (TID 53). 1493 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 54.0 (TID 53) in 48 ms on localhost (executor driver) (1/1)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 54.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 54 (runJob at PythonRDD.scala:141) finished in 0.055 s
18/06/20 07:51:57 INFO DAGScheduler: Job 27 finished: runJob at PythonRDD.scala:141, took 0.058062 s
18/06/20 07:51:57 INFO ZooKeeper: Session: 0x0 closed
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Got job 28 (runJob at PythonRDD.scala:141) with 3 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 56 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 55)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopping receiver with message: Error starting receiver 0: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Called receiver onStop
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Deregistering receiver 0
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 7.0 KB, free 365.9 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 4.1 KB, free 365.9 MB)
18/06/20 07:51:57 INFO ClientCnxn: EventThread shut down
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 28 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1, 2, 3))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 56.0 with 3 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 56.0 (TID 54, localhost, executor driver, partition 1, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 1.0 in stage 56.0 (TID 55, localhost, executor driver, partition 2, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 2.0 in stage 56.0 (TID 56, localhost, executor driver, partition 3, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 1.0 in stage 56.0 (TID 55)
18/06/20 07:51:57 INFO Executor: Running task 2.0 in stage 56.0 (TID 56)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 56.0 (TID 54)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopped receiver 0
18/06/20 07:51:57 INFO BlockGenerator: Stopping BlockGenerator
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -944, init = 985, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 1.0 in stage 56.0 (TID 55). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 1.0 in stage 56.0 (TID 55) in 52 ms on localhost (executor driver) (1/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 45, boot = -944, init = 989, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -32, init = 72, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 56.0 (TID 54). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 56.0 (TID 54) in 56 ms on localhost (executor driver) (2/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -33, init = 73, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 2.0 in stage 56.0 (TID 56). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 2.0 in stage 56.0 (TID 56) in 58 ms on localhost (executor driver) (3/3)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 56.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 56 (runJob at PythonRDD.scala:141) finished in 0.063 s
18/06/20 07:51:57 INFO DAGScheduler: Job 28 finished: runJob at PythonRDD.scala:141, took 0.065728 s
-------------------------------------------
Time: 2018-06-20 07:51:57
-------------------------------------------
18/06/20 07:51:57 INFO JobScheduler: Finished job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Total delay: 0.169 s for time 1529481117000 ms (execution: 0.149 s)
18/06/20 07:51:57 INFO PythonRDD: Removing RDD 102 from persistence list
18/06/20 07:51:57 INFO BlockManager: Removing RDD 102
18/06/20 07:51:57 INFO BlockRDD: Removing RDD 97 from persistence list
18/06/20 07:51:57 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[97] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481117000 ms
18/06/20 07:51:57 INFO BlockManager: Removing RDD 97
18/06/20 07:51:57 INFO ReceivedBlockTracker: Deleting batches: 1529481115000 ms
18/06/20 07:51:57 INFO InputInfoTracker: remove old batch metadata: 1529481115000 ms
18/06/20 07:51:57 INFO RecurringTimer: Stopped timer for BlockGenerator after time 1529481117400
18/06/20 07:51:57 INFO BlockGenerator: Waiting for block pushing thread to terminate
18/06/20 07:51:57 INFO BlockGenerator: Pushing out the last 0 blocks
18/06/20 07:51:57 INFO BlockGenerator: Stopped block pushing thread
18/06/20 07:51:57 INFO BlockGenerator: Stopped BlockGenerator
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Waiting for receiver to be stopped
18/06/20 07:51:57 ERROR ReceiverSupervisorImpl: Stopped receiver with error: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO TaskSchedulerImpl: Cancelling stage 0
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 0 (start at NativeMethodAccessorImpl.java:0) failed in 13.256 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
18/06/20 07:51:57 ERROR ReceiverTracker: Receiver has been stopped. Try to restart it.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Even on same VPC check for security groups of the two systems. If they have different security groups you probably need to allow inbound and outbound ports. Another way of verifying it is try to telnet and ping both systems from one another.

Finished task before parsing query

I am trying to run a spark application in AWS EMR. I have written the whole program in sparkSQL. Since the program was taking too long to complete i checked the log info and observed that executors were already executing tasks but did not find any log on the parsing the SQL commands.
Here is a snippet of the log info.
17/02/12 04:32:56 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1603 on executor id: 20 hostname: ip-10-11-203-20.ec2.internal.
17/02/12 04:32:56 INFO TaskSetManager: Finished task 179.0 in stage 32.0 (TID 1585) in 42268 ms on ip-10-11-203-20.ec2.internal (182/200)
17/02/12 04:33:02 INFO TaskSetManager: Starting task 198.0 in stage 32.0 (TID 1604, ip-10-178-43-214.ec2.internal, partition 198, NODE_LOCAL, 5295 bytes)
17/02/12 04:33:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1604 on executor id: 13 hostname: ip-10-178-43-214.ec2.internal.
17/02/12 04:33:02 INFO TaskSetManager: Finished task 180.0 in stage 32.0 (TID 1588) in 39417 ms on ip-10-178-43-214.ec2.internal (183/200)
17/02/12 04:33:03 INFO TaskSetManager: Starting task 199.0 in stage 32.0 (TID 1605, ip-10-11-203-20.ec2.internal, partition 199, NODE_LOCAL, 5295 bytes)
17/02/12 04:33:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1605 on executor id: 18 hostname: ip-10-11-203-20.ec2.internal.
17/02/12 04:33:03 INFO TaskSetManager: Finished task 183.0 in stage 32.0 (TID 1589) in 38574 ms on ip-10-11-203-20.ec2.internal (184/200)
17/02/12 04:33:04 INFO TaskSetManager: Finished task 186.0 in stage 32.0 (TID 1592) in 34329 ms on ip-10-11-203-20.ec2.internal (185/200)
17/02/12 04:33:15 INFO TaskSetManager: Finished task 187.0 in stage 32.0 (TID 1593) in 38905 ms on ip-10-178-43-214.ec2.internal (186/200)
Can anyone please explain what is going on here. Thanks.
Parsing SQL is actually quite fast in spark and if you take look at the beginning of the logs you will found parsing logs for sure.
What you can see now is just execution of query - spark divides every execution stage to tasks (to achieve parallel execution) and these Finished task logs just inform you that your query is in progress.

Subclassing SparkException doesn't allow for message passing between Master and Workers

I need to build an application where a master node distributes a large dataset to a number of worker nodes for parallel processing. I'm running this application on a single machine and JVM, therefore I've called setMaster("local[4]") on my SparkConf object. I'm using Spark 1.5.2 and Scala 2.10.5 through IntelliJ.
If a certain condition occurs in the portions of the dataset handled by the executors, I need the master node to be notified and perform some action. In addition to that, I need the other executors to die. To that end, I looked around the Scala Spark API and realized that SparkException allows me to do the first portion of what I'm looking for, by propagating the exception (which is Serializable, by the way) to the driver. I have verified this experimentally, as follows:
def main(args:Array[String]) = {
val conf = new SparkConf().setAppName("Spark Exceptions").setMaster("local[4]")
val sc = new SparkContext(conf)
val l = Range(1, 5000)
val parl = sc.parallelize(l, 8);
val mappedRDD = parl.map(func)
try {
val res = mappedRDD.collect()
println(res)
} catch {
case s:SparkException => println("A worker threw an exception.")
case t:Throwable => throw(t)
}
}
def func(i:Int) = {
if(i == 1 || i == 4000)
throw new SparkException("Bad number detected.")
else
Math.pow(i, 2)
}
If you look closely at the example above, you will note that since the original Range contains both 1 and 4000, two failures are guaranteed in the worker nodes. Indeed, I see two executors failing in stderr, while my stdout is populated with:
A worker threw an exception.
Process finished with exit code 0
Unfortunately, the SparkException thrown does not kill the other executors, since, as mentioned before, I can see both executors failing in stderr, while two other executors complete their tasks successfully. So my first question is: is there any way I can immediately kill the other executors once this exception is caught by the driver program?
My second question is a little bit more subtle: I'd like some information to be exchanged from the executors to the worker node about what piece of information caused the error. Sure, I could write to and read from a file, particularly since I'm on the same filesystem, but I'd like a faster and more elegant solution. So I thought I'd subclass SparkException in order to add a field that described what piece of data caused the error:
import org.apache.spark.SparkException
class WorkerViolation(msg:String, data:Any) extends SparkException(msg) {
override def toString = "A worker violation occurred: " + msg
def getData = data
def this(dat:Any) = this("Error at worker.", dat)
}
The goal is to be able to use the getData accessor to retrieve some information. To that end, I tried modifying the program above, as follows:
...
catch {
case w:WorkerViolation => println("A worker threw an exception, with data: " + w.getData)
case t:Throwable => throw(t)
}
}
def func(i:Int) = {
if(i == 1 || i == 4000)
throw new WorkerViolation("Bad number detected.", i)
else
Math.pow(i, 2)
}
Note that this time I'm both throwing and catching WorkerViolations. Unfortunately, this particular exception seems to be killing the driver node as well. The full trace is of course gigantic, yet copied for consistency:
15/12/07 18:31:17 WARN util.Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.2.222 instead (on interface eth0)
15/12/07 18:31:17 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/07 18:31:17 INFO spark.SecurityManager: Changing view acls to: jason
15/12/07 18:31:17 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jason)
15/12/07 18:31:17 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/12/07 18:31:17 INFO Remoting: Starting remoting
15/12/07 18:31:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark#192.168.2.222:33572]
15/12/07 18:31:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark#192.168.2.222:33572]
15/12/07 18:31:17 INFO spark.SparkEnv: Registering MapOutputTracker
15/12/07 18:31:17 INFO spark.SparkEnv: Registering BlockManagerMaster
15/12/07 18:31:17 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20151207183117-4300
15/12/07 18:31:17 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB.
15/12/07 18:31:17 INFO network.ConnectionManager: Bound socket to port 34704 with id = ConnectionManagerId(192.168.2.222,34704)
15/12/07 18:31:17 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/12/07 18:31:17 INFO storage.BlockManagerInfo: Registering block manager 192.168.2.222:34704 with 2.1 GB RAM
15/12/07 18:31:17 INFO storage.BlockManagerMaster: Registered BlockManager
15/12/07 18:31:17 INFO spark.HttpServer: Starting HTTP Server
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:42426
15/12/07 18:31:17 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.2.222:42426
15/12/07 18:31:17 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-0ae72587-14c5-4bfe-a151-2bcafc889ee8
15/12/07 18:31:17 INFO spark.HttpServer: Starting HTTP Server
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:55556
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/12/07 18:31:17 INFO ui.SparkUI: Started SparkUI at http://192.168.2.222:4040
15/12/07 18:31:18 INFO spark.SparkContext: Starting job: collect at SparkExceptions.scala:16
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Got job 0 (collect at SparkExceptions.scala:16) with 8 output partitions (allowLocal=false)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at SparkExceptions.scala:16)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Missing parents: List()
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkExceptions.scala:14), which has no missing parents
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from Stage 0 (MappedRDD[1] at map at SparkExceptions.scala:14)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1350 bytes in 4 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:2 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:3 as 1350 bytes in 1 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 3
15/12/07 18:31:18 INFO executor.Executor: Running task ID 1
15/12/07 18:31:18 INFO executor.Executor: Running task ID 0
15/12/07 18:31:18 INFO executor.Executor: Running task ID 2
15/12/07 18:31:18 ERROR executor.Executor: Exception in task ID 0
A worker violation occurred: Bad number detected.
at SparkExceptions$.func(SparkExceptions.scala:26)
at SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 2 is 5565
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 1 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 2 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Sending result for 1 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 3 is 5565
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 2
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 1
15/12/07 18:31:18 INFO executor.Executor: Sending result for 3 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 3
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:4 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 4
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:5 as 1350 bytes in 1 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 5
15/12/07 18:31:18 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 4 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 4 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 4
15/12/07 18:31:18 WARN scheduler.TaskSetManager: Loss was due to helpers.WorkerViolation
A worker violation occurred: Bad number detected.
at SparkExceptions$.func(SparkExceptions.scala:26)
at SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 5 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 5 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 5
15/12/07 18:31:18 ERROR scheduler.TaskSetManager: Task 0.0:0 failed 1 times; aborting job
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 2 in 27 ms on localhost (progress: 1/8)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 1 in 30 ms on localhost (progress: 2/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 4 in 11 ms on localhost (progress: 3/8)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Failed to run collect at SparkExceptions.scala:16
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: A worker violation occurred: Bad number detected.
SparkExceptions$.func(SparkExceptions.scala:26)
SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 5 in 11 ms on localhost (progress: 4/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 3 in 34 ms on localhost (progress: 5/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Process finished with exit code 1
So my second question would then be: Why does throwing an exception of a class derived from SparkException kill the driver program as well? Is there a different strategy I can use for executor-driver communication?
FWIW, I have decided that in order to allow for a higher degree of message-passing between nodes, going down to the level of akka actors is the preferred way to go.

Apache spark message understanding

Request help to understand this message..
INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is **2202921** bytes
what does 2202921 mean here?
My job does a shuffle operation and while reading shuffle files from previous stage, it gives the message first and then after sometime it fails with below error:
14/11/12 11:09:46 WARN scheduler.TaskSetManager: Lost task 224.0 in stage 4.0 (TID 13938, ip-xx-xxx-xxx-xx.ec2.internal): FetchFailed(BlockManagerId(11, ip-xx-xxx-xxx-xx.ec2.internal, 48073, 0), shuffleId=2, mapId=7468, reduceId=224)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Marking Stage 4 (coalesce at <console>:49) as failed due to a fetch failure from Stage 3 (map at <console>:42)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Stage 4 (coalesce at <console>:49) failed in 213.446 s
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting Stage 3 (map at <console>:42) and Stage 4 (coalesce at <console>:49) due to fetch failure
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 2)
14/11/12 11:09:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster.
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Removed 11 successfully in removeExecutor
14/11/12 11:09:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 11 (11893/12836, false)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting failed stages
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting Stage 3 (MappedRDD[13] at map at <console>:42), which has no missing parents
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(25472) called with curMem=474762, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 24.9 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(5160) called with curMem=500234, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.0 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on ip-xx.ec2.internal:35571 (size: 5.0 KB, free: 10.4 GB)
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Updated info of block broadcast_6_piece0
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting 943 missing tasks from Stage 3 (MappedRDD[13] at map at <console>:42)
14/11/12 11:09:46 INFO cluster.YarnClientClusterScheduler: Adding task set 3.1 with 943 tasks
My code looks like this,
(rdd1 ++ rdd2).map { t => ((t.id), t) }.groupByKey(1280).map {
case ((id), sequence) =>
val newrecord = sequence.maxBy {
case Fact(id, key, type, day, group, c_key, s_key, plan_id,size,
is_mom, customer_shipment_id, customer_shipment_item_id, asin, company_key, product_line_key, dw_last_updated, measures) => dw_last_updated.toLong
}
((PARTITION_KEY + "=" + newrecord.day.toString + "/part"), (newrecord))
}.coalesce(2048,true).saveAsTextFile("s3://myfolder/PT/test20nodes/")```
I derived 1280 as I have 20 nodes each having 32 cores. I derived it like 2*32*20.
For a Shuffle stage, it will create some ShuffleMapTasks which output the intermediate results to the disk. The location information will be stored in MapStatuses and sent to the MapOutputTrackerMaster(the driver).
Then when the next stage starts to run, it needs these location statuses. So executors will ask MapOutputTrackerMaster to fetch them. MapOutputTrackerMaster will serialize these status to bytes and send them to executors. Here is the size of these status in bytes.
These status will be sent via Akka. And Akka has a limitation to the max message size. You can set it via spark.akka.frameSize:
Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB. Increase this if your tasks need to send back large results to the driver (e.g. using collect() on a large dataset).
If the size is greater than spark.akka.frameSize, Akka will refuse to deliver the message and your job will fail. Therefore it can help you adjust spark.akka.frameSize to a best one.