Related
My spark program is failing and neither the scheduler, driver or executors are providing any sort of useful error, apart from Exit status 137. What could be causing spark to fail?
The crash seems to happen during the conversion of an RDD to a Dataframe:
val df = sqlc.createDataFrame(processedData, schema).persist()
Right before the crash, the logs look like this:
Scheduler
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 WARN TaskSetManager: Stage 11 contains a task of very large size (22028 KB). The maximum recommended task size is 100 KB.
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 23, 10.141.1.247, executor 1133b735-967d-136c-2bbf-ffcb3884c88c-1548129213980, partition 0, PROCESS_LOCAL, 22557269 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 1.0 in stage 11.0 (TID 24, 10.141.3.144, executor a92ceb18-b46a-c986-4672-cab9086c54c2-1548129202094, partition 1, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 2.0 in stage 11.0 (TID 25, 10.141.1.56, executor b9167d92-bed2-fe21-46fd-08f2c6fd1998-1548129206680, partition 2, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:05 INFO JobUtils: stderr: 19/01/22 04:01:04 INFO TaskSetManager: Starting task 3.0 in stage 11.0 (TID 26, 10.141.3.146, executor 0cf7394b-540d-2a6c-258a-e27bbedbdd0e-1548129212488, partition 3, PROCESS_LOCAL, 22558910 bytes)
19/01/22 04:01:09 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
...
19/01/22 04:13:45 DEBUG JobUtils: Tracing alloc 12943f1a-82ed-d4f4-07b3-dfbe5a46716b for driver
19/01/22 04:13:46 INFO JobUtils: driver Terminated -- Exit status 137
19/01/22 04:13:46 INFO JobUtils: driver Restarting -- Restart within policy
Driver
19/01/22 04:01:12 INFO DAGScheduler: Job 7 finished: runJob at SparkHadoopMapReduceWriter.scala:88, took 8.008375 s
19/01/22 04:01:12 INFO SparkHadoopMapReduceWriter: Job job_20190122040104_0032 committed.
19/01/22 04:01:13 INFO MapPartitionsRDD: Removing RDD 28 from persistence list
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
Executors (Some variation of this)
19/01/22 04:01:13 INFO BlockManager: Removing RDD 28
19/01/22 04:13:45 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver 10.141.2.48:21297 disassociated! Shutting down.
19/01/22 04:13:45 INFO DiskBlockManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Shutdown hook called
19/01/22 04:13:45 INFO ShutdownHookManager: Deleting directory /alloc/spark-ce736cb6-8b8e-4891-b9c7-06ea9d9cf797
I'm trying to Execute the Example program given in Spark Directory on HDP cluster "/spark2/examples/src/main/python/streaming/kafka_wordcount.py" which tries to read kafka topic but gives Zookeeper server timeout error.
Spark is installed on HDP Cluster and Kafka is running on HDF Cluster, both are running on different cluster and are in same VPC on AWS
Command executed to run spark example on HDP cluster is "bin/spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.3.0.jar examples/src/main/python/streaming/kafka_wordcount.py HDF-cluster-ip-address:2181 topic"
Error Image :
enter image description here
-------------------------------------------
Time: 2018-06-20 07:51:56
-------------------------------------------
18/06/20 07:51:56 INFO JobScheduler: Finished job streaming job 1529481116000 ms.0 from job set of time 1529481116000 ms
18/06/20 07:51:56 INFO JobScheduler: Total delay: 0.171 s for time 1529481116000 ms (execution: 0.145 s)
18/06/20 07:51:56 INFO PythonRDD: Removing RDD 94 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 94
18/06/20 07:51:56 INFO BlockRDD: Removing RDD 89 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 89
18/06/20 07:51:56 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[89] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481116000 ms
18/06/20 07:51:56 INFO ReceivedBlockTracker: Deleting batches: 1529481114000 ms
18/06/20 07:51:56 INFO InputInfoTracker: remove old batch metadata: 1529481114000 ms
18/06/20 07:51:57 INFO JobScheduler: Added jobs for time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Starting job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Registering RDD 107 (call at /usr/hdp/2.6.5.0-292/spark2/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py:2257)
18/06/20 07:51:57 INFO DAGScheduler: Got job 27 (runJob at PythonRDD.scala:141) with 1 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 54 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 53)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27 stored as values in memory (estimated size 7.0 KB, free 366.0 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 4.1 KB, free 366.0 MB)
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_27_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 27 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 54.0 with 1 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 54.0 (TID 53, localhost, executor driver, partition 0, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 54.0 (TID 53)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -881, init = 921, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -881, init = 922, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 54.0 (TID 53). 1493 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 54.0 (TID 53) in 48 ms on localhost (executor driver) (1/1)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 54.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 54 (runJob at PythonRDD.scala:141) finished in 0.055 s
18/06/20 07:51:57 INFO DAGScheduler: Job 27 finished: runJob at PythonRDD.scala:141, took 0.058062 s
18/06/20 07:51:57 INFO ZooKeeper: Session: 0x0 closed
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Got job 28 (runJob at PythonRDD.scala:141) with 3 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 56 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 55)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopping receiver with message: Error starting receiver 0: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Called receiver onStop
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Deregistering receiver 0
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 7.0 KB, free 365.9 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 4.1 KB, free 365.9 MB)
18/06/20 07:51:57 INFO ClientCnxn: EventThread shut down
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 28 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1, 2, 3))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 56.0 with 3 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 56.0 (TID 54, localhost, executor driver, partition 1, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 1.0 in stage 56.0 (TID 55, localhost, executor driver, partition 2, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 2.0 in stage 56.0 (TID 56, localhost, executor driver, partition 3, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 1.0 in stage 56.0 (TID 55)
18/06/20 07:51:57 INFO Executor: Running task 2.0 in stage 56.0 (TID 56)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 56.0 (TID 54)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopped receiver 0
18/06/20 07:51:57 INFO BlockGenerator: Stopping BlockGenerator
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -944, init = 985, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 1.0 in stage 56.0 (TID 55). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 1.0 in stage 56.0 (TID 55) in 52 ms on localhost (executor driver) (1/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 45, boot = -944, init = 989, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -32, init = 72, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 56.0 (TID 54). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 56.0 (TID 54) in 56 ms on localhost (executor driver) (2/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -33, init = 73, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 2.0 in stage 56.0 (TID 56). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 2.0 in stage 56.0 (TID 56) in 58 ms on localhost (executor driver) (3/3)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 56.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 56 (runJob at PythonRDD.scala:141) finished in 0.063 s
18/06/20 07:51:57 INFO DAGScheduler: Job 28 finished: runJob at PythonRDD.scala:141, took 0.065728 s
-------------------------------------------
Time: 2018-06-20 07:51:57
-------------------------------------------
18/06/20 07:51:57 INFO JobScheduler: Finished job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Total delay: 0.169 s for time 1529481117000 ms (execution: 0.149 s)
18/06/20 07:51:57 INFO PythonRDD: Removing RDD 102 from persistence list
18/06/20 07:51:57 INFO BlockManager: Removing RDD 102
18/06/20 07:51:57 INFO BlockRDD: Removing RDD 97 from persistence list
18/06/20 07:51:57 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[97] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481117000 ms
18/06/20 07:51:57 INFO BlockManager: Removing RDD 97
18/06/20 07:51:57 INFO ReceivedBlockTracker: Deleting batches: 1529481115000 ms
18/06/20 07:51:57 INFO InputInfoTracker: remove old batch metadata: 1529481115000 ms
18/06/20 07:51:57 INFO RecurringTimer: Stopped timer for BlockGenerator after time 1529481117400
18/06/20 07:51:57 INFO BlockGenerator: Waiting for block pushing thread to terminate
18/06/20 07:51:57 INFO BlockGenerator: Pushing out the last 0 blocks
18/06/20 07:51:57 INFO BlockGenerator: Stopped block pushing thread
18/06/20 07:51:57 INFO BlockGenerator: Stopped BlockGenerator
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Waiting for receiver to be stopped
18/06/20 07:51:57 ERROR ReceiverSupervisorImpl: Stopped receiver with error: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO TaskSchedulerImpl: Cancelling stage 0
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 0 (start at NativeMethodAccessorImpl.java:0) failed in 13.256 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
18/06/20 07:51:57 ERROR ReceiverTracker: Receiver has been stopped. Try to restart it.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Even on same VPC check for security groups of the two systems. If they have different security groups you probably need to allow inbound and outbound ports. Another way of verifying it is try to telnet and ping both systems from one another.
I'm running spark in stand-alone mode on a single machine. I have a RDD by the name of productUserVectors, like this
[("11342",Map(..)),("21435",Map(..)),...]
The number of rows in normalisedVectors are 8164. I wanted to get all possible pair combinations between the rows of this RDD and compute a score based on the maps in each row. I used cartesian to get all possible pairs, and I'm filtering them as shown below
scala> val normalisedVectors = productUserVector.map(line=>utilInst.normaliseVector(line)).sortBy(_._1.toInt)
scala> val combinedRDD = normalisedVectors.cartesian(normalisedVectors).filter(line=>line._1._1.toInt > line._2._1.toInt && utilInst.filterStyleAtp(line._1._1,line._2._1))
scala> val scoresRDD = combinedRDD.map(line=>utilInst.getScore(line)).filter(line=>line._3 > 0)
scala> val finalRDD = scoresRDD.map(line=> (line._1,List((line._2,line._3)))).reduceByKey(_ ++ _)
scala> finalRDD.saveAsTextFile(outputPath)
I have set driver memory at 8GB and executor memory at 2GB. Here, utilInst and it's functions are used to filter pairs from the results of cartesian of the original RDD. However, the output shows that it goes into an endless loop as shown by the logs below
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/11/17 18:50:31 INFO executor.Executor: Finished task 3.0 in stage 0.0 (TID 3). 1491 bytes result sent to driver
16/11/17 18:50:31 INFO executor.Executor: Finished task 5.0 in stage 0.0 (TID 5). 1491 bytes result sent to driver
16/11/17 18:50:31 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 17339 ms on localhost (1/6)
16/11/17 18:50:31 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 17346 ms on localhost (2/6)
16/11/17 18:50:31 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1491 bytes result sent to driver
16/11/17 18:50:31 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 17423 ms on localhost (3/6)
16/11/17 18:50:32 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1491 bytes result sent to driver
16/11/17 18:50:32 INFO executor.Executor: Finished task 2.0 in stage 0.0 (TID 2). 1491 bytes result sent to driver
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 18092 ms on localhost (4/6)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 18063 ms on localhost (5/6)
16/11/17 18:50:32 INFO executor.Executor: Finished task 4.0 in stage 0.0 (TID 4). 1491 bytes result sent to driver
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 18073 ms on localhost (6/6)
16/11/17 18:50:32 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/11/17 18:50:32 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (union at iterateUsers.scala:84) finished in 18.125 s
16/11/17 18:50:32 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/11/17 18:50:32 INFO scheduler.DAGScheduler: running: Set()
16/11/17 18:50:32 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
16/11/17 18:50:32 INFO scheduler.DAGScheduler: failed: Set()
16/11/17 18:50:32 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[11] at reduceByKey at iterateUsers.scala:87), which has no missing parents
16/11/17 18:50:32 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 4.1 GB)
16/11/17 18:50:32 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1819.0 B, free 4.1 GB)
16/11/17 18:50:32 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 127.0.0.1:60497 (size: 1819.0 B, free: 4.1 GB)
16/11/17 18:50:32 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1012
16/11/17 18:50:32 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ResultStage 1 (ShuffledRDD[11] at reduceByKey at iterateUsers.scala:87)
16/11/17 18:50:32 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 6 tasks
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 6, localhost, partition 0, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 7, localhost, partition 1, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 8, localhost, partition 2, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 1.0 (TID 9, localhost, partition 3, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 1.0 (TID 10, localhost, partition 4, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.0 (TID 11, localhost, partition 5, ANY, 5126 bytes)
16/11/17 18:50:32 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 6)
16/11/17 18:50:32 INFO executor.Executor: Running task 5.0 in stage 1.0 (TID 11)
16/11/17 18:50:32 INFO executor.Executor: Running task 1.0 in stage 1.0 (TID 7)
16/11/17 18:50:32 INFO executor.Executor: Running task 3.0 in stage 1.0 (TID 9)
16/11/17 18:50:32 INFO executor.Executor: Running task 2.0 in stage 1.0 (TID 8)
16/11/17 18:50:32 INFO executor.Executor: Running task 4.0 in stage 1.0 (TID 10)
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO executor.Executor: Finished task 3.0 in stage 1.0 (TID 9). 1512 bytes result sent to driver
16/11/17 18:50:32 INFO executor.Executor: Finished task 1.0 in stage 1.0 (TID 7). 1512 bytes result sent to driver
16/11/17 18:50:32 INFO executor.Executor: Finished task 4.0 in stage 1.0 (TID 10). 1512 bytes result sent to driver
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 1.0 (TID 9) in 277 ms on localhost (1/6)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 7) in 283 ms on localhost (2/6)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0 (TID 10) in 279 ms on localhost (3/6)
16/11/17 18:50:37 INFO executor.Executor: Finished task 2.0 in stage 1.0 (TID 8). 1512 bytes result sent to driver
16/11/17 18:50:37 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 6). 1512 bytes result sent to driver
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 6) in 5120 ms on localhost (4/6)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 8) in 5114 ms on localhost (5/6)
16/11/17 18:50:37 INFO executor.Executor: Finished task 5.0 in stage 1.0 (TID 11). 1512 bytes result sent to driver
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 1.0 (TID 11) in 5241 ms on localhost (6/6)
16/11/17 18:50:37 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/11/17 18:50:37 INFO scheduler.DAGScheduler: ResultStage 1 (count at iterateUsers.scala:88) finished in 5.254 s
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Job 0 finished: count at iterateUsers.scala:88, took 23.534860 s
8164
16/11/17 18:50:37 INFO rdd.UnionRDD: Removing RDD 10 from persistence list
16/11/17 18:50:37 INFO storage.BlockManager: Removing RDD 10
16/11/17 18:50:37 INFO spark.SparkContext: Starting job: sortBy at iterateUsers.scala:91
16/11/17 18:50:37 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 191 bytes
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Got job 1 (sortBy at iterateUsers.scala:91) with 6 output partitions
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (sortBy at iterateUsers.scala:91)
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Missing parents: List()
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[15] at sortBy at iterateUsers.scala:91), which has no missing parents
16/11/17 18:50:37 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 4.4 KB, free 4.1 GB)
16/11/17 18:50:37 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.5 KB, free 4.1 GB)
16/11/17 18:50:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 127.0.0.1:60497 (size: 2.5 KB, free: 4.1 GB)
16/11/17 18:50:37 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1012
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ResultStage 3 (MapPartitionsRDD[15] at sortBy at iterateUsers.scala:91)
16/11/17 18:50:37 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 6 tasks
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 12, localhost, partition 0, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 3.0 (TID 13, localhost, partition 1, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 3.0 (TID 14, localhost, partition 2, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 3.0 (TID 15, localhost, partition 3, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 3.0 (TID 16, localhost, partition 4, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 3.0 (TID 17, localhost, partition 5, ANY, 5210 bytes)
16/11/17 18:50:37 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 12)
16/11/17 18:50:37 INFO executor.Executor: Running task 4.0 in stage 3.0 (TID 16)
16/11/17 18:50:37 INFO executor.Executor: Running task 3.0 in stage 3.0 (TID 15)
16/11/17 18:50:37 INFO executor.Executor: Running task 1.0 in stage 3.0 (TID 13)
16/11/17 18:50:37 INFO executor.Executor: Running task 2.0 in stage 3.0 (TID 14)
16/11/17 18:50:37 INFO executor.Executor: Running task 5.0 in stage 3.0 (TID 17)
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:38 INFO executor.Executor: Finished task 5.0 in stage 3.0 (TID 17). 1818 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 4.0 in stage 3.0 (TID 16). 1818 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 3.0 in stage 3.0 (TID 15). 1728 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 12). 1724 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 2.0 in stage 3.0 (TID 14). 1727 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 1.0 in stage 3.0 (TID 13). 1734 bytes result sent to driver
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 3.0 (TID 17) in 117 ms on localhost (1/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 3.0 (TID 16) in 120 ms on localhost (2/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 3.0 (TID 15) in 123 ms on localhost (3/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 12) in 130 ms on localhost (4/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 3.0 (TID 14) in 128 ms on localhost (5/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 13) in 130 ms on localhost (6/6)
16/11/17 18:50:38 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
16/11/17 18:50:38 INFO scheduler.DAGScheduler: ResultStage 3 (sortBy at iterateUsers.scala:91) finished in 0.133 s
16/11/17 18:50:38 INFO scheduler.DAGScheduler: Job 1 finished: sortBy at iterateUsers.scala:91, took 0.154474 s
16/11/17 18:50:38 INFO rdd.ShuffledRDD: Removing RDD 11 from persistence list
16/11/17 18:50:38 INFO storage.BlockManager: Removing RDD 11
16/11/17 18:50:44 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on 127.0.0.1:60497 in memory (size: 2.5 KB, free: 4.1 GB)
16/11/17 18:50:44 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 127.0.0.1:60497 in memory (size: 1819.0 B, free: 4.1 GB)
16/11/17 18:51:37 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 127.0.0.1:60497 in memory (size: 3.1 KB, free: 4.1 GB)
16/11/17 18:52:48 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/11/17 18:52:48 INFO spark.SparkContext: Starting job: saveAsTextFile at iterateUsers.scala:99
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Registering RDD 13 (sortBy at iterateUsers.scala:91)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Registering RDD 22 (map at iterateUsers.scala:98)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Got job 2 (saveAsTextFile at iterateUsers.scala:99) with 36 output partitions
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 7 (saveAsTextFile at iterateUsers.scala:99)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 6)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[13] at sortBy at iterateUsers.scala:91), which has no missing parents
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 33.5 MB, free 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.0 MB, free 4.1 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece1 stored as bytes in memory (estimated size 4.0 MB, free 4.1 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece1 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece2 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece2 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece3 stored as bytes in memory (estimated size 2.9 MB, free 4.0 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece3 in memory on 127.0.0.1:60497 (size: 2.9 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1012
16/11/17 18:52:50 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[13] at sortBy at iterateUsers.scala:91)
16/11/17 18:52:50 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 6 tasks
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 18, localhost, partition 0, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 19, localhost, partition 1, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 5.0 (TID 20, localhost, partition 2, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 5.0 (TID 21, localhost, partition 3, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 5.0 (TID 22, localhost, partition 4, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 5.0 (TID 23, localhost, partition 5, ANY, 5207 bytes)
16/11/17 18:52:50 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 18)
16/11/17 18:52:50 INFO executor.Executor: Running task 1.0 in stage 5.0 (TID 19)
16/11/17 18:52:50 INFO executor.Executor: Running task 2.0 in stage 5.0 (TID 20)
16/11/17 18:52:50 INFO executor.Executor: Running task 3.0 in stage 5.0 (TID 21)
16/11/17 18:52:50 INFO executor.Executor: Running task 4.0 in stage 5.0 (TID 22)
16/11/17 18:52:50 INFO executor.Executor: Running task 5.0 in stage 5.0 (TID 23)
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:02 INFO executor.Executor: Finished task 2.0 in stage 5.0 (TID 20). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 18). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 5.0 (TID 20) in 12006 ms on localhost (1/6)
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 18) in 12011 ms on localhost (2/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 5.0 in stage 5.0 (TID 23). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 5.0 (TID 23) in 12019 ms on localhost (3/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 4.0 in stage 5.0 (TID 22). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 5.0 (TID 22) in 12027 ms on localhost (4/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 3.0 in stage 5.0 (TID 21). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 5.0 (TID 21) in 12044 ms on localhost (5/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 1.0 in stage 5.0 (TID 19). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 5.0 (TID 19) in 12059 ms on localhost (6/6)
16/11/17 18:53:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
16/11/17 18:53:02 INFO scheduler.DAGScheduler: ShuffleMapStage 5 (sortBy at iterateUsers.scala:91) finished in 12.061 s
16/11/17 18:53:02 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/11/17 18:53:02 INFO scheduler.DAGScheduler: running: Set()
16/11/17 18:53:02 INFO scheduler.DAGScheduler: waiting: Set(ShuffleMapStage 6, ResultStage 7)
16/11/17 18:53:02 INFO scheduler.DAGScheduler: failed: Set()
16/11/17 18:53:02 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 6 (MapPartitionsRDD[22] at map at iterateUsers.scala:98), which has no missing parents
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 33.5 MB, free 4.0 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece1 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece1 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece2 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece2 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece3 stored as bytes in memory (estimated size 2.9 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece3 in memory on 127.0.0.1:60497 (size: 2.9 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1012
16/11/17 18:53:05 INFO scheduler.DAGScheduler: Submitting 36 missing tasks from ShuffleMapStage 6 (MapPartitionsRDD[22] at map at iterateUsers.scala:98)
16/11/17 18:53:05 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 36 tasks
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 24, localhost, partition 0, ANY, 5411 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 6.0 (TID 25, localhost, partition 1, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 6.0 (TID 26, localhost, partition 2, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 6.0 (TID 27, localhost, partition 3, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 6.0 (TID 28, localhost, partition 4, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 6.0 (TID 29, localhost, partition 5, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 6.0 (TID 30, localhost, partition 6, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 6.0 (TID 31, localhost, partition 7, ANY, 5411 bytes)
16/11/17 18:53:05 INFO executor.Executor: Running task 1.0 in stage 6.0 (TID 25)
16/11/17 18:53:05 INFO executor.Executor: Running task 0.0 in stage 6.0 (TID 24)
16/11/17 18:53:05 INFO executor.Executor: Running task 4.0 in stage 6.0 (TID 28)
16/11/17 18:53:05 INFO executor.Executor: Running task 2.0 in stage 6.0 (TID 26)
16/11/17 18:53:05 INFO executor.Executor: Running task 3.0 in stage 6.0 (TID 27)
16/11/17 18:53:05 INFO executor.Executor: Running task 5.0 in stage 6.0 (TID 29)
16/11/17 18:53:05 INFO executor.Executor: Running task 6.0 in stage 6.0 (TID 30)
16/11/17 18:53:05 INFO executor.Executor: Running task 7.0 in stage 6.0 (TID 31)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 127.0.0.1:60497 in memory (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece3 on 127.0.0.1:60497 in memory (size: 2.9 MB, free: 4.1 GB)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece2 on 127.0.0.1:60497 in memory (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece1 on 127.0.0.1:60497 in memory (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
It gets stuck in the last storage.ShuffleBlockFetcherIterator phase endlessly while storing finalRDD into a text file. I have no idea as to why it's happening. Any help to resolve this is highly appreciated.
I have created a sparkUDF. When I run it on spark-shell it runs perfectly fine. But when I register it and use in my sparkSQL query it gives NullPointerException.
scala> test_proc("1605","(#supp In (-1,118)")
16/03/07 10:35:04 INFO TaskSetManager: Finished task 0.0 in stage 21.0 (TID 220) in 62 ms on cdts1hdpdn01d.rxcorp.com (1/1)
16/03/07 10:35:04 INFO YarnScheduler: Removed TaskSet 21.0, whose tasks have all completed, from pool
16/03/07 10:35:04 INFO DAGScheduler: ResultStage 21 (first at :45) finished in 0.062 s 16/03/07 10:35:04 INFO DAGScheduler: Job 16 finished: first at :45, took 2.406408 s
res14: Int = 1
scala>
But when I register it and use it in my sparkSQL query, it gives NPE.
scala> sqlContext.udf.register("store_proc", test_proc _)
scala> hiveContext.sql("select store_proc('1605' , '(#supp In (-1,118)')").first.getInt(0)
16/03/07 10:37:58 INFO ParseDriver: Parsing command: select store_proc('1605' , '(#supp In (-1,118)') 16/03/07 10:37:58 INFO ParseDriver: Parse Completed 16/03/07 10:37:58 INFO SparkContext: Starting job: first at :24
16/03/07 10:37:58 INFO DAGScheduler: Got job 17 (first at :24) with 1 output partitions 16/03/07 10:37:58 INFO DAGScheduler: Final stage: ResultStage 22(first at :24) 16/03/07 10:37:58 INFO DAGScheduler: Parents of final stage: List()
16/03/07 10:37:58 INFO DAGScheduler: Missing parents: List()
16/03/07 10:37:58 INFO DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[86] at first at :24), which has no missing parents
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(10520) called with curMem=1472899, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 10.3 KB, free 2.1 GB)
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(4774) called with curMem=1483419, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 4.7 KB, free 2.1 GB)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 162.44.214.87:47564 (size: 4.7 KB, free: 2.1 GB)
16/03/07 10:37:58 INFO SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:861
16/03/07 10:37:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 22 (MapPartitionsRDD[86] at first at :24)
16/03/07 10:37:58 INFO YarnScheduler: Adding task set 22.0 with 1 tasks
16/03/07 10:37:58 INFO TaskSetManager: Starting task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com, partition 0,PROCESS_LOCAL, 2155 bytes)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on cdts1hdpdn02d.rxcorp.com:33678 (size: 4.7 KB, free: 6.7 GB)
16/03/07 10:37:58 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com): java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:291) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725) at $line20.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.test_proc(:41)
This is sample of my 'test_proc':
def test_proc(x:String, y:String):Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val z:Int = hiveContext.sql("select 7").first.getInt(0)
return z
}
Based on the output from a standalone call it looks like test_proc is executing some kind of Spark action and this cannot work inside UDF because Spark doesn't support nested operations on distributed data structures. If test_proc is using SQLContext this will result in NPP since Spark contexts exist only on the driver.
If that's the case you'll have restructure your code to achieve desired effect either using local (most likely broadcasted) variables or joins.
I have a cluster made by two slaves and one master and set up and I submit a jar (scala) to the spark master (192.168.1.64):
spark-submit --master spark://spark-master:7077 --class tests.elements target/scala-2.10/zzz-project_2.10-1.0.jar
After quite sometime running just fine it stops abruptly with the last lines on the terminal being
...
15/08/19 17:45:24 INFO scheduler.TaskSchedulerImpl: Adding task set 411292.0 with 6 tasks
15/08/19 17:45:24 WARN scheduler.TaskSetManager: Stage 411292 contains a task of very large size (2762 KB). The maximum recommended task size is 100 KB.
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 411292.0 (TID 1832, 192.168.1.64, PROCESS_LOCAL, 2828792 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 411292.0 (TID 1833, 192.168.1.62, PROCESS_LOCAL, 2310009 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 411292.0 (TID 1834, 192.168.1.64, PROCESS_LOCAL, 2669188 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 411292.0 (TID 1835, 192.168.1.62, PROCESS_LOCAL, 2295676 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 411292.0 (TID 1836, 192.168.1.64, PROCESS_LOCAL, 2847786 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 411292.0 (TID 1837, 192.168.1.64, PROCESS_LOCAL, 2913528 bytes)
Killed
and the error occurring at the master log is the following:
...
15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/0 on worker worker-20150819160925-192.168.1.64-51640
15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/1 on worker worker-20150819160938-192.168.1.62-38007
15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver#192.168.1.64:46823 got disassociated, removing it.
15/08/19 16:15:44 INFO master.Master: Removing app app-20150819160949-0001
15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:44 WARN master.Master: Application testPageRank is still in progress, it may be terminated abnormally.
...
Both workers have in their logs something like this
...
15/08/19 16:15:49 INFO worker.Worker: Executor app-20150819160949-0001/0 finished with state EXITED message Command exited with code 1 exitStatus 1
15/08/19 16:15:50 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#192.168.1.64:54799] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
and
...
15/08/19 16:15:43 INFO worker.Worker: Executor app-20150819160949-0001/1 finished with state EXITED message Command exited with code 1 exitStatus 1
15/08/19 16:15:43 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor#192.168.1.62:53325] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
respectively. The work/app files contain something like this
...
15/08/19 16:15:41 INFO executor.Executor: Finished task 1.0 in stage 387758.0 (TID 1803). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO executor.Executor: Finished task 4.0 in stage 387758.0 (TID 1806). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_5 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 5.0 in stage 387758.0 (TID 1807). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_3 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 3.0 in stage 387758.0 (TID 1805). 1911 bytes result sent to driver
15/08/19 16:15:44 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:45 INFO storage.DiskBlockManager: Shutdown hook called
15/08/19 16:15:46 INFO util.Utils: Shutdown hook called
and
...
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_0 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 2.0 in stage 387758.0 (TID 1804). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO executor.Executor: Finished task 0.0 in stage 387758.0 (TID 1802). 1911 bytes result sent to driver
15/08/19 16:15:42 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
15/08/19 16:15:42 INFO storage.DiskBlockManager: Shutdown hook called
15/08/19 16:15:42 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:42 INFO util.Utils: Shutdown hook called
respectively. There seem to be no other error in hdfs or spark.
I am suspecting that the error lies in the master log, the third line (15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver#192.168.1.64:46823 got disassociated, removing it.) but I can't figure out why. I tried changing the spark.akka.heartbeat.interval to 100 as suggested in some posts but no luck. Anyone would know why it happens and how to solve this? Thanks so much.
As mentioned in a very similar question here WARN ReliableDeliverySupervisor: Association with remote system has failed, address is now gated for [5000] ms. Reason: [Disassociated]
The problem is likely to be the lack of memory. Adding more memory (or in that case more nodes) should solve the problem.
(Alternately, needing less memory should work too of course).