I'm running spark in stand-alone mode on a single machine. I have a RDD by the name of productUserVectors, like this
[("11342",Map(..)),("21435",Map(..)),...]
The number of rows in normalisedVectors are 8164. I wanted to get all possible pair combinations between the rows of this RDD and compute a score based on the maps in each row. I used cartesian to get all possible pairs, and I'm filtering them as shown below
scala> val normalisedVectors = productUserVector.map(line=>utilInst.normaliseVector(line)).sortBy(_._1.toInt)
scala> val combinedRDD = normalisedVectors.cartesian(normalisedVectors).filter(line=>line._1._1.toInt > line._2._1.toInt && utilInst.filterStyleAtp(line._1._1,line._2._1))
scala> val scoresRDD = combinedRDD.map(line=>utilInst.getScore(line)).filter(line=>line._3 > 0)
scala> val finalRDD = scoresRDD.map(line=> (line._1,List((line._2,line._3)))).reduceByKey(_ ++ _)
scala> finalRDD.saveAsTextFile(outputPath)
I have set driver memory at 8GB and executor memory at 2GB. Here, utilInst and it's functions are used to filter pairs from the results of cartesian of the original RDD. However, the output shows that it goes into an endless loop as shown by the logs below
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/11/17 18:50:14 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/11/17 18:50:31 INFO executor.Executor: Finished task 3.0 in stage 0.0 (TID 3). 1491 bytes result sent to driver
16/11/17 18:50:31 INFO executor.Executor: Finished task 5.0 in stage 0.0 (TID 5). 1491 bytes result sent to driver
16/11/17 18:50:31 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 17339 ms on localhost (1/6)
16/11/17 18:50:31 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 17346 ms on localhost (2/6)
16/11/17 18:50:31 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1491 bytes result sent to driver
16/11/17 18:50:31 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 17423 ms on localhost (3/6)
16/11/17 18:50:32 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1491 bytes result sent to driver
16/11/17 18:50:32 INFO executor.Executor: Finished task 2.0 in stage 0.0 (TID 2). 1491 bytes result sent to driver
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 18092 ms on localhost (4/6)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 18063 ms on localhost (5/6)
16/11/17 18:50:32 INFO executor.Executor: Finished task 4.0 in stage 0.0 (TID 4). 1491 bytes result sent to driver
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 18073 ms on localhost (6/6)
16/11/17 18:50:32 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/11/17 18:50:32 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (union at iterateUsers.scala:84) finished in 18.125 s
16/11/17 18:50:32 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/11/17 18:50:32 INFO scheduler.DAGScheduler: running: Set()
16/11/17 18:50:32 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
16/11/17 18:50:32 INFO scheduler.DAGScheduler: failed: Set()
16/11/17 18:50:32 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[11] at reduceByKey at iterateUsers.scala:87), which has no missing parents
16/11/17 18:50:32 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.9 KB, free 4.1 GB)
16/11/17 18:50:32 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1819.0 B, free 4.1 GB)
16/11/17 18:50:32 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 127.0.0.1:60497 (size: 1819.0 B, free: 4.1 GB)
16/11/17 18:50:32 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1012
16/11/17 18:50:32 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ResultStage 1 (ShuffledRDD[11] at reduceByKey at iterateUsers.scala:87)
16/11/17 18:50:32 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 6 tasks
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 6, localhost, partition 0, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 7, localhost, partition 1, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 1.0 (TID 8, localhost, partition 2, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 1.0 (TID 9, localhost, partition 3, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 1.0 (TID 10, localhost, partition 4, ANY, 5126 bytes)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 1.0 (TID 11, localhost, partition 5, ANY, 5126 bytes)
16/11/17 18:50:32 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 6)
16/11/17 18:50:32 INFO executor.Executor: Running task 5.0 in stage 1.0 (TID 11)
16/11/17 18:50:32 INFO executor.Executor: Running task 1.0 in stage 1.0 (TID 7)
16/11/17 18:50:32 INFO executor.Executor: Running task 3.0 in stage 1.0 (TID 9)
16/11/17 18:50:32 INFO executor.Executor: Running task 2.0 in stage 1.0 (TID 8)
16/11/17 18:50:32 INFO executor.Executor: Running task 4.0 in stage 1.0 (TID 10)
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 6 ms
16/11/17 18:50:32 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 5 ms
16/11/17 18:50:32 INFO executor.Executor: Finished task 3.0 in stage 1.0 (TID 9). 1512 bytes result sent to driver
16/11/17 18:50:32 INFO executor.Executor: Finished task 1.0 in stage 1.0 (TID 7). 1512 bytes result sent to driver
16/11/17 18:50:32 INFO executor.Executor: Finished task 4.0 in stage 1.0 (TID 10). 1512 bytes result sent to driver
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 1.0 (TID 9) in 277 ms on localhost (1/6)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 7) in 283 ms on localhost (2/6)
16/11/17 18:50:32 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 1.0 (TID 10) in 279 ms on localhost (3/6)
16/11/17 18:50:37 INFO executor.Executor: Finished task 2.0 in stage 1.0 (TID 8). 1512 bytes result sent to driver
16/11/17 18:50:37 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 6). 1512 bytes result sent to driver
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 6) in 5120 ms on localhost (4/6)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 1.0 (TID 8) in 5114 ms on localhost (5/6)
16/11/17 18:50:37 INFO executor.Executor: Finished task 5.0 in stage 1.0 (TID 11). 1512 bytes result sent to driver
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 1.0 (TID 11) in 5241 ms on localhost (6/6)
16/11/17 18:50:37 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/11/17 18:50:37 INFO scheduler.DAGScheduler: ResultStage 1 (count at iterateUsers.scala:88) finished in 5.254 s
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Job 0 finished: count at iterateUsers.scala:88, took 23.534860 s
8164
16/11/17 18:50:37 INFO rdd.UnionRDD: Removing RDD 10 from persistence list
16/11/17 18:50:37 INFO storage.BlockManager: Removing RDD 10
16/11/17 18:50:37 INFO spark.SparkContext: Starting job: sortBy at iterateUsers.scala:91
16/11/17 18:50:37 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 191 bytes
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Got job 1 (sortBy at iterateUsers.scala:91) with 6 output partitions
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Final stage: ResultStage 3 (sortBy at iterateUsers.scala:91)
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 2)
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Missing parents: List()
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[15] at sortBy at iterateUsers.scala:91), which has no missing parents
16/11/17 18:50:37 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 4.4 KB, free 4.1 GB)
16/11/17 18:50:37 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.5 KB, free 4.1 GB)
16/11/17 18:50:37 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 127.0.0.1:60497 (size: 2.5 KB, free: 4.1 GB)
16/11/17 18:50:37 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1012
16/11/17 18:50:37 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ResultStage 3 (MapPartitionsRDD[15] at sortBy at iterateUsers.scala:91)
16/11/17 18:50:37 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 6 tasks
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 12, localhost, partition 0, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 3.0 (TID 13, localhost, partition 1, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 3.0 (TID 14, localhost, partition 2, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 3.0 (TID 15, localhost, partition 3, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 3.0 (TID 16, localhost, partition 4, ANY, 5210 bytes)
16/11/17 18:50:37 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 3.0 (TID 17, localhost, partition 5, ANY, 5210 bytes)
16/11/17 18:50:37 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 12)
16/11/17 18:50:37 INFO executor.Executor: Running task 4.0 in stage 3.0 (TID 16)
16/11/17 18:50:37 INFO executor.Executor: Running task 3.0 in stage 3.0 (TID 15)
16/11/17 18:50:37 INFO executor.Executor: Running task 1.0 in stage 3.0 (TID 13)
16/11/17 18:50:37 INFO executor.Executor: Running task 2.0 in stage 3.0 (TID 14)
16/11/17 18:50:37 INFO executor.Executor: Running task 5.0 in stage 3.0 (TID 17)
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:50:37 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:50:38 INFO executor.Executor: Finished task 5.0 in stage 3.0 (TID 17). 1818 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 4.0 in stage 3.0 (TID 16). 1818 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 3.0 in stage 3.0 (TID 15). 1728 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 12). 1724 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 2.0 in stage 3.0 (TID 14). 1727 bytes result sent to driver
16/11/17 18:50:38 INFO executor.Executor: Finished task 1.0 in stage 3.0 (TID 13). 1734 bytes result sent to driver
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 3.0 (TID 17) in 117 ms on localhost (1/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 3.0 (TID 16) in 120 ms on localhost (2/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 3.0 (TID 15) in 123 ms on localhost (3/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 12) in 130 ms on localhost (4/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 3.0 (TID 14) in 128 ms on localhost (5/6)
16/11/17 18:50:38 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 3.0 (TID 13) in 130 ms on localhost (6/6)
16/11/17 18:50:38 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
16/11/17 18:50:38 INFO scheduler.DAGScheduler: ResultStage 3 (sortBy at iterateUsers.scala:91) finished in 0.133 s
16/11/17 18:50:38 INFO scheduler.DAGScheduler: Job 1 finished: sortBy at iterateUsers.scala:91, took 0.154474 s
16/11/17 18:50:38 INFO rdd.ShuffledRDD: Removing RDD 11 from persistence list
16/11/17 18:50:38 INFO storage.BlockManager: Removing RDD 11
16/11/17 18:50:44 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on 127.0.0.1:60497 in memory (size: 2.5 KB, free: 4.1 GB)
16/11/17 18:50:44 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 127.0.0.1:60497 in memory (size: 1819.0 B, free: 4.1 GB)
16/11/17 18:51:37 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 127.0.0.1:60497 in memory (size: 3.1 KB, free: 4.1 GB)
16/11/17 18:52:48 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
16/11/17 18:52:48 INFO spark.SparkContext: Starting job: saveAsTextFile at iterateUsers.scala:99
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Registering RDD 13 (sortBy at iterateUsers.scala:91)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Registering RDD 22 (map at iterateUsers.scala:98)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Got job 2 (saveAsTextFile at iterateUsers.scala:99) with 36 output partitions
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 7 (saveAsTextFile at iterateUsers.scala:99)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 6)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 6)
16/11/17 18:52:48 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 5 (MapPartitionsRDD[13] at sortBy at iterateUsers.scala:91), which has no missing parents
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 33.5 MB, free 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.0 MB, free 4.1 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece1 stored as bytes in memory (estimated size 4.0 MB, free 4.1 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece1 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece2 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece2 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO memory.MemoryStore: Block broadcast_4_piece3 stored as bytes in memory (estimated size 2.9 MB, free 4.0 GB)
16/11/17 18:52:50 INFO storage.BlockManagerInfo: Added broadcast_4_piece3 in memory on 127.0.0.1:60497 (size: 2.9 MB, free: 4.1 GB)
16/11/17 18:52:50 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1012
16/11/17 18:52:50 INFO scheduler.DAGScheduler: Submitting 6 missing tasks from ShuffleMapStage 5 (MapPartitionsRDD[13] at sortBy at iterateUsers.scala:91)
16/11/17 18:52:50 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 6 tasks
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 18, localhost, partition 0, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 5.0 (TID 19, localhost, partition 1, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 5.0 (TID 20, localhost, partition 2, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 5.0 (TID 21, localhost, partition 3, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 5.0 (TID 22, localhost, partition 4, ANY, 5207 bytes)
16/11/17 18:52:50 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 5.0 (TID 23, localhost, partition 5, ANY, 5207 bytes)
16/11/17 18:52:50 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 18)
16/11/17 18:52:50 INFO executor.Executor: Running task 1.0 in stage 5.0 (TID 19)
16/11/17 18:52:50 INFO executor.Executor: Running task 2.0 in stage 5.0 (TID 20)
16/11/17 18:52:50 INFO executor.Executor: Running task 3.0 in stage 5.0 (TID 21)
16/11/17 18:52:50 INFO executor.Executor: Running task 4.0 in stage 5.0 (TID 22)
16/11/17 18:52:50 INFO executor.Executor: Running task 5.0 in stage 5.0 (TID 23)
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:02 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:02 INFO executor.Executor: Finished task 2.0 in stage 5.0 (TID 20). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 18). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 5.0 (TID 20) in 12006 ms on localhost (1/6)
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 18) in 12011 ms on localhost (2/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 5.0 in stage 5.0 (TID 23). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 5.0 (TID 23) in 12019 ms on localhost (3/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 4.0 in stage 5.0 (TID 22). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 5.0 (TID 22) in 12027 ms on localhost (4/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 3.0 in stage 5.0 (TID 21). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 5.0 (TID 21) in 12044 ms on localhost (5/6)
16/11/17 18:53:02 INFO executor.Executor: Finished task 1.0 in stage 5.0 (TID 19). 1883 bytes result sent to driver
16/11/17 18:53:02 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 5.0 (TID 19) in 12059 ms on localhost (6/6)
16/11/17 18:53:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
16/11/17 18:53:02 INFO scheduler.DAGScheduler: ShuffleMapStage 5 (sortBy at iterateUsers.scala:91) finished in 12.061 s
16/11/17 18:53:02 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/11/17 18:53:02 INFO scheduler.DAGScheduler: running: Set()
16/11/17 18:53:02 INFO scheduler.DAGScheduler: waiting: Set(ShuffleMapStage 6, ResultStage 7)
16/11/17 18:53:02 INFO scheduler.DAGScheduler: failed: Set()
16/11/17 18:53:02 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 6 (MapPartitionsRDD[22] at map at iterateUsers.scala:98), which has no missing parents
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 33.5 MB, free 4.0 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece1 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece1 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece2 stored as bytes in memory (estimated size 4.0 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece2 in memory on 127.0.0.1:60497 (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO memory.MemoryStore: Block broadcast_5_piece3 stored as bytes in memory (estimated size 2.9 MB, free 4.0 GB)
16/11/17 18:53:05 INFO storage.BlockManagerInfo: Added broadcast_5_piece3 in memory on 127.0.0.1:60497 (size: 2.9 MB, free: 4.1 GB)
16/11/17 18:53:05 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1012
16/11/17 18:53:05 INFO scheduler.DAGScheduler: Submitting 36 missing tasks from ShuffleMapStage 6 (MapPartitionsRDD[22] at map at iterateUsers.scala:98)
16/11/17 18:53:05 INFO scheduler.TaskSchedulerImpl: Adding task set 6.0 with 36 tasks
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 6.0 (TID 24, localhost, partition 0, ANY, 5411 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 6.0 (TID 25, localhost, partition 1, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 6.0 (TID 26, localhost, partition 2, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 6.0 (TID 27, localhost, partition 3, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 6.0 (TID 28, localhost, partition 4, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 6.0 (TID 29, localhost, partition 5, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 6.0 (TID 30, localhost, partition 6, ANY, 5420 bytes)
16/11/17 18:53:05 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 6.0 (TID 31, localhost, partition 7, ANY, 5411 bytes)
16/11/17 18:53:05 INFO executor.Executor: Running task 1.0 in stage 6.0 (TID 25)
16/11/17 18:53:05 INFO executor.Executor: Running task 0.0 in stage 6.0 (TID 24)
16/11/17 18:53:05 INFO executor.Executor: Running task 4.0 in stage 6.0 (TID 28)
16/11/17 18:53:05 INFO executor.Executor: Running task 2.0 in stage 6.0 (TID 26)
16/11/17 18:53:05 INFO executor.Executor: Running task 3.0 in stage 6.0 (TID 27)
16/11/17 18:53:05 INFO executor.Executor: Running task 5.0 in stage 6.0 (TID 29)
16/11/17 18:53:05 INFO executor.Executor: Running task 6.0 in stage 6.0 (TID 30)
16/11/17 18:53:05 INFO executor.Executor: Running task 7.0 in stage 6.0 (TID 31)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece0 on 127.0.0.1:60497 in memory (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece3 on 127.0.0.1:60497 in memory (size: 2.9 MB, free: 4.1 GB)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece2 on 127.0.0.1:60497 in memory (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:13 INFO storage.BlockManagerInfo: Removed broadcast_4_piece1 on 127.0.0.1:60497 in memory (size: 4.0 MB, free: 4.1 GB)
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/11/17 18:53:30 INFO storage.ShuffleBlockFetcherIterator: Getting 6 non-empty blocks out of 6 blocks
It gets stuck in the last storage.ShuffleBlockFetcherIterator phase endlessly while storing finalRDD into a text file. I have no idea as to why it's happening. Any help to resolve this is highly appreciated.
Related
I'm trying to Execute the Example program given in Spark Directory on HDP cluster "/spark2/examples/src/main/python/streaming/kafka_wordcount.py" which tries to read kafka topic but gives Zookeeper server timeout error.
Spark is installed on HDP Cluster and Kafka is running on HDF Cluster, both are running on different cluster and are in same VPC on AWS
Command executed to run spark example on HDP cluster is "bin/spark-submit --jars spark-streaming-kafka-0-8-assembly_2.11-2.3.0.jar examples/src/main/python/streaming/kafka_wordcount.py HDF-cluster-ip-address:2181 topic"
Error Image :
enter image description here
-------------------------------------------
Time: 2018-06-20 07:51:56
-------------------------------------------
18/06/20 07:51:56 INFO JobScheduler: Finished job streaming job 1529481116000 ms.0 from job set of time 1529481116000 ms
18/06/20 07:51:56 INFO JobScheduler: Total delay: 0.171 s for time 1529481116000 ms (execution: 0.145 s)
18/06/20 07:51:56 INFO PythonRDD: Removing RDD 94 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 94
18/06/20 07:51:56 INFO BlockRDD: Removing RDD 89 from persistence list
18/06/20 07:51:56 INFO BlockManager: Removing RDD 89
18/06/20 07:51:56 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[89] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481116000 ms
18/06/20 07:51:56 INFO ReceivedBlockTracker: Deleting batches: 1529481114000 ms
18/06/20 07:51:56 INFO InputInfoTracker: remove old batch metadata: 1529481114000 ms
18/06/20 07:51:57 INFO JobScheduler: Added jobs for time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Starting job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Registering RDD 107 (call at /usr/hdp/2.6.5.0-292/spark2/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py:2257)
18/06/20 07:51:57 INFO DAGScheduler: Got job 27 (runJob at PythonRDD.scala:141) with 1 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 54 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 53)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27 stored as values in memory (estimated size 7.0 KB, free 366.0 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 4.1 KB, free 366.0 MB)
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_27_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 27 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 54 (PythonRDD[111] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 54.0 with 1 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 54.0 (TID 53, localhost, executor driver, partition 0, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 54.0 (TID 53)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -881, init = 921, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -881, init = 922, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 54.0 (TID 53). 1493 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 54.0 (TID 53) in 48 ms on localhost (executor driver) (1/1)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 54.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 54 (runJob at PythonRDD.scala:141) finished in 0.055 s
18/06/20 07:51:57 INFO DAGScheduler: Job 27 finished: runJob at PythonRDD.scala:141, took 0.058062 s
18/06/20 07:51:57 INFO ZooKeeper: Session: 0x0 closed
18/06/20 07:51:57 INFO SparkContext: Starting job: runJob at PythonRDD.scala:141
18/06/20 07:51:57 INFO DAGScheduler: Got job 28 (runJob at PythonRDD.scala:141) with 3 output partitions
18/06/20 07:51:57 INFO DAGScheduler: Final stage: ResultStage 56 (runJob at PythonRDD.scala:141)
18/06/20 07:51:57 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 55)
18/06/20 07:51:57 INFO DAGScheduler: Missing parents: List()
18/06/20 07:51:57 INFO DAGScheduler: Submitting ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48), which has no missing parents
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopping receiver with message: Error starting receiver 0: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Called receiver onStop
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Deregistering receiver 0
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 7.0 KB, free 365.9 MB)
18/06/20 07:51:57 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 4.1 KB, free 365.9 MB)
18/06/20 07:51:57 INFO ClientCnxn: EventThread shut down
18/06/20 07:51:57 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on ip-10-29-3-74.ec2.internal:46231 (size: 4.1 KB, free: 366.2 MB)
18/06/20 07:51:57 INFO SparkContext: Created broadcast 28 from broadcast at DAGScheduler.scala:1039
18/06/20 07:51:57 INFO DAGScheduler: Submitting 3 missing tasks from ResultStage 56 (PythonRDD[112] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(1, 2, 3))
18/06/20 07:51:57 INFO TaskSchedulerImpl: Adding task set 56.0 with 3 tasks
18/06/20 07:51:57 INFO TaskSetManager: Starting task 0.0 in stage 56.0 (TID 54, localhost, executor driver, partition 1, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 1.0 in stage 56.0 (TID 55, localhost, executor driver, partition 2, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO TaskSetManager: Starting task 2.0 in stage 56.0 (TID 56, localhost, executor driver, partition 3, PROCESS_LOCAL, 7649 bytes)
18/06/20 07:51:57 INFO Executor: Running task 1.0 in stage 56.0 (TID 55)
18/06/20 07:51:57 INFO Executor: Running task 2.0 in stage 56.0 (TID 56)
18/06/20 07:51:57 INFO Executor: Running task 0.0 in stage 56.0 (TID 54)
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 0 blocks
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/06/20 07:51:57 ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Stopped receiver 0
18/06/20 07:51:57 INFO BlockGenerator: Stopping BlockGenerator
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -947, init = 987, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 41, boot = -944, init = 985, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 1.0 in stage 56.0 (TID 55). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 1.0 in stage 56.0 (TID 55) in 52 ms on localhost (executor driver) (1/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 45, boot = -944, init = 989, finish = 0
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -32, init = 72, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 0.0 in stage 56.0 (TID 54). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 0.0 in stage 56.0 (TID 54) in 56 ms on localhost (executor driver) (2/3)
18/06/20 07:51:57 INFO PythonRunner: Times: total = 40, boot = -33, init = 73, finish = 0
18/06/20 07:51:57 INFO Executor: Finished task 2.0 in stage 56.0 (TID 56). 1536 bytes result sent to driver
18/06/20 07:51:57 INFO TaskSetManager: Finished task 2.0 in stage 56.0 (TID 56) in 58 ms on localhost (executor driver) (3/3)
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 56.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 56 (runJob at PythonRDD.scala:141) finished in 0.063 s
18/06/20 07:51:57 INFO DAGScheduler: Job 28 finished: runJob at PythonRDD.scala:141, took 0.065728 s
-------------------------------------------
Time: 2018-06-20 07:51:57
-------------------------------------------
18/06/20 07:51:57 INFO JobScheduler: Finished job streaming job 1529481117000 ms.0 from job set of time 1529481117000 ms
18/06/20 07:51:57 INFO JobScheduler: Total delay: 0.169 s for time 1529481117000 ms (execution: 0.149 s)
18/06/20 07:51:57 INFO PythonRDD: Removing RDD 102 from persistence list
18/06/20 07:51:57 INFO BlockManager: Removing RDD 102
18/06/20 07:51:57 INFO BlockRDD: Removing RDD 97 from persistence list
18/06/20 07:51:57 INFO KafkaInputDStream: Removing blocks of RDD BlockRDD[97] at createStream at NativeMethodAccessorImpl.java:0 of time 1529481117000 ms
18/06/20 07:51:57 INFO BlockManager: Removing RDD 97
18/06/20 07:51:57 INFO ReceivedBlockTracker: Deleting batches: 1529481115000 ms
18/06/20 07:51:57 INFO InputInfoTracker: remove old batch metadata: 1529481115000 ms
18/06/20 07:51:57 INFO RecurringTimer: Stopped timer for BlockGenerator after time 1529481117400
18/06/20 07:51:57 INFO BlockGenerator: Waiting for block pushing thread to terminate
18/06/20 07:51:57 INFO BlockGenerator: Pushing out the last 0 blocks
18/06/20 07:51:57 INFO BlockGenerator: Stopped block pushing thread
18/06/20 07:51:57 INFO BlockGenerator: Stopped BlockGenerator
18/06/20 07:51:57 INFO ReceiverSupervisorImpl: Waiting for receiver to be stopped
18/06/20 07:51:57 ERROR ReceiverSupervisorImpl: Stopped receiver with error: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
18/06/20 07:51:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/06/20 07:51:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
18/06/20 07:51:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/06/20 07:51:57 INFO TaskSchedulerImpl: Cancelling stage 0
18/06/20 07:51:57 INFO DAGScheduler: ResultStage 0 (start at NativeMethodAccessorImpl.java:0) failed in 13.256 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
18/06/20 07:51:57 ERROR ReceiverTracker: Receiver has been stopped. Try to restart it.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Caused by: org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 10000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:171)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:126)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:143)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:94)
at org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:100)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:149)
at org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:131)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:600)
at org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:590)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:2185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Even on same VPC check for security groups of the two systems. If they have different security groups you probably need to allow inbound and outbound ports. Another way of verifying it is try to telnet and ping both systems from one another.
In scala, what happens when I use a global map variable in scala without broadcasting?
E.g. if I get a variable using collect* (such as collectAsMap), it seems it is a global variable, and I can use it in all RDD.mapValues() functions without explicitly broadcasting it.
BUT I know spark works distributedly, and it should not be able to process a global memory-stored variable without broadcasting it. So, what happened?
Code example (this code call tf-idf in text, where df is stored in a Map):
//dfMap is a String->int Map in memory
//Array[(String, Int)] = Array((B,2), (A,3), (C,1))
val dfMap = dfrdd.collectAsMap;
//tfrdd is a rdd, and I can use dfMap in its mapValues function
//tfrdd: Array((doc1,Map(A -> 3.0)), (doc2,Map(A -> 2.0, B -> 1.0)))
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.getOrElse(x._1, 1) ) );
tfidfrdd.saveAsTextFile("/somedir/result/");
The code works just fine. My question is what happened there? Does the driver send the dfMap to all workers just like broadcasting or else?
What's the difference if I code broadcasting explicitely like this:
dfMap = sc.broadcast(dfrdd.collectAsMap)
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.value.getOrElse(x._1, 1) )
I've checked more resources and aggregating others' answers and put it in order. The difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is like this:
1) When using external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor.
So when the variable (say a map) is large enough (more than 20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>20K) is suggested to be broadcasted explicitly.
2) When using external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.
There is no difference between a Global Map Variable and a Broadcast variable. If we use a global variable in a map function of an RDD then it will be broadcasted to all nodes. For example:
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> rdd.filter(elem => list.contains(elem)).collect
17/03/16 10:21:53 INFO SparkContext: Starting job: collect at <console>:29
17/03/16 10:21:53 INFO DAGScheduler: Got job 3 (collect at <console>:29) with 4 output partitions
17/03/16 10:21:53 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:29)
17/03/16 10:21:53 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:21:53 INFO DAGScheduler: Missing parents: List()
17/03/16 10:21:53 DEBUG DAGScheduler: submitStage(ResultStage 3)
17/03/16 10:21:53 DEBUG DAGScheduler: missing: List()
17/03/16 10:21:53 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29), which has no missing parents
17/03/16 10:21:53 DEBUG DAGScheduler: submitMissingTasks(ResultStage 3)
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 5.0 KB, free 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4 locally took 1 ms
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4 without replication took 1 ms
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.5 KB, free 366.3 MB)
17/03/16 10:21:53 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.2.123:37645 (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4_piece0 locally took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(1)
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4_piece0 without replication took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 1
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 1
17/03/16 10:21:53 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:996
17/03/16 10:21:53 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29)
17/03/16 10:21:53 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Adding task set 3.0 with 4 tasks
17/03/16 10:21:53 DEBUG TaskSetManager: Epoch for TaskSet 3.0: 0
17/03/16 10:21:53 DEBUG TaskSetManager: Valid locality levels for TaskSet 3.0: NO_PREF, ANY
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 12, localhost, executor driver, partition 0, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 13, localhost, executor driver, partition 1, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 14, localhost, executor driver, partition 2, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID 15, localhost, executor driver, partition 3, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO Executor: Running task 0.0 in stage 3.0 (TID 12)
17/03/16 10:21:53 DEBUG Executor: Task 12's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 INFO Executor: Running task 2.0 in stage 3.0 (TID 14)
17/03/16 10:21:53 INFO Executor: Running task 1.0 in stage 3.0 (TID 13)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1
17/03/16 10:21:53 INFO Executor: Running task 3.0 in stage 3.0 (TID 15)
17/03/16 10:21:53 DEBUG Executor: Task 13's epoch is 0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1 of size 5112 dropped from memory (free 384072627)
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1_piece0 of size 2535 dropped from memory (free 384075162)
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.2.123:37645 in memory (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 14's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 15's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 1, response is 0
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 1
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(3)
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 3
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 3
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3_piece0 of size 3309 dropped from memory (free 384078471)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.2.123:37645 in memory (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3 of size 6904 dropped from memory (free 384085375)
17/03/16 10:21:53 INFO Executor: Finished task 1.0 in stage 3.0 (TID 13). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 3, response is 0
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 3
17/03/16 10:21:53 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:21:53 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 13) in 36 ms on localhost (executor driver) (1/4)
17/03/16 10:21:53 INFO Executor: Finished task 2.0 in stage 3.0 (TID 14). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 3
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 2
17/03/16 10:21:53 INFO Executor: Finished task 0.0 in stage 3.0 (TID 12). 912 bytes result sent to driver
17/03/16 10:21:53 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 14) in 36 ms on localhost (executor driver) (2/4)
17/03/16 10:21:53 INFO Executor: Finished task 3.0 in stage 3.0 (TID 15). 908 bytes result sent to driver
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 1
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID 15) in 36 ms on localhost (executor driver) (3/4)
17/03/16 10:21:53 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 12) in 45 ms on localhost (executor driver) (4/4)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/03/16 10:21:53 INFO DAGScheduler: ResultStage 3 (collect at <console>:29) finished in 0.045 s
17/03/16 10:21:53 DEBUG DAGScheduler: After removal of stage 3, remaining stages = 0
17/03/16 10:21:53 INFO DAGScheduler: Job 3 finished: collect at <console>:29, took 0.097564 s
res4: Array[Int] = Array(1, 2, 3)
In above log we can clearly see that global variable list is broadcasted . So, is the case when we explicitly broadcast the list.
scala> val br = sc.broadcast(list)
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 160.0 B, free 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5 without replication took 1 ms
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 227.0 B, free 366.3 MB)
17/03/16 10:26:40 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.2.123:37645 (size: 227.0 B, free: 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManagerMaster: Updated info of block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Told master about block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5_piece0 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5_piece0 without replication took 1 ms
17/03/16 10:26:40 INFO SparkContext: Created broadcast 5 from broadcast at <console>:26
br: org.apache.spark.broadcast.Broadcast[List[Int]] = Broadcast(5)
scala> rdd.filter(elem => br.value.contains(elem)).collect
17/03/16 10:27:50 INFO SparkContext: Starting job: collect at <console>:31
17/03/16 10:27:50 INFO DAGScheduler: Got job 0 (collect at <console>:31) with 4 output partitions
17/03/16 10:27:50 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:31)
17/03/16 10:27:50 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:27:50 INFO DAGScheduler: Missing parents: List()
17/03/16 10:27:50 DEBUG DAGScheduler: submitStage(ResultStage 0)
17/03/16 10:27:50 DEBUG DAGScheduler: missing: List()
17/03/16 10:27:50 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31), which has no missing parents
17/03/16 10:27:50 DEBUG DAGScheduler: submitMissingTasks(ResultStage 0)
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.7 KB, free 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1 locally took 6 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1 without replication took 6 ms
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KB, free 366.3 MB)
17/03/16 10:27:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.2.123:37303 (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1_piece0 locally took 2 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1_piece0 without replication took 2 ms
17/03/16 10:27:50 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/03/16 10:27:50 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31)
17/03/16 10:27:50 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:27:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/03/16 10:27:50 DEBUG TaskSetManager: Epoch for TaskSet 0.0: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:50 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/03/16 10:27:51 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/03/16 10:27:51 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/03/16 10:27:51 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/03/16 10:27:51 DEBUG Executor: Task 0's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 2's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 3's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 1's epoch is 0
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 908 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 999 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 912 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 912 bytes result sent to driver
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 3
17/03/16 10:27:51 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 2
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 1
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:51 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 165 ms on localhost (executor driver) (1/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 180 ms on localhost (executor driver) (2/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 249 ms on localhost (executor driver) (3/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 186 ms on localhost (executor driver) (4/4)
17/03/16 10:27:51 INFO DAGScheduler: ResultStage 0 (collect at <console>:31) finished in 0.264 s
17/03/16 10:27:51 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/16 10:27:51 DEBUG DAGScheduler: After removal of stage 0, remaining stages = 0
17/03/16 10:27:51 INFO DAGScheduler: Job 0 finished: collect at <console>:31, took 0.381615 s
res1: Array[Int] = Array(1, 2, 3)
Same is the case with Broadcast variable.
When you broadcast, the data is cached by all the nodes. so when you are performing an action (collect, saveAsTextFile, head) operation the broadcasted values are already available to all the worker nodes.
But if you do not broadcast the value, when performing an action each worker node needs to perform a shuffle to get the data from the driver node.
First off it is a spark thing - not a scala one
The diff is values are broadcasted everytime they are used whereas explicit broadcasts are cached.
"Broadcast variables are created from a variable v by calling
SparkContext.broadcast(v). The broadcast variable is a wrapper around
v, and its value can be accessed by calling the value method ... After the broadcast variable is created, it should
be used instead of the value v in any functions run on the cluster so
that v is not shipped to the nodes more than once"
I am trying to run a spark application in AWS EMR. I have written the whole program in sparkSQL. Since the program was taking too long to complete i checked the log info and observed that executors were already executing tasks but did not find any log on the parsing the SQL commands.
Here is a snippet of the log info.
17/02/12 04:32:56 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1603 on executor id: 20 hostname: ip-10-11-203-20.ec2.internal.
17/02/12 04:32:56 INFO TaskSetManager: Finished task 179.0 in stage 32.0 (TID 1585) in 42268 ms on ip-10-11-203-20.ec2.internal (182/200)
17/02/12 04:33:02 INFO TaskSetManager: Starting task 198.0 in stage 32.0 (TID 1604, ip-10-178-43-214.ec2.internal, partition 198, NODE_LOCAL, 5295 bytes)
17/02/12 04:33:02 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1604 on executor id: 13 hostname: ip-10-178-43-214.ec2.internal.
17/02/12 04:33:02 INFO TaskSetManager: Finished task 180.0 in stage 32.0 (TID 1588) in 39417 ms on ip-10-178-43-214.ec2.internal (183/200)
17/02/12 04:33:03 INFO TaskSetManager: Starting task 199.0 in stage 32.0 (TID 1605, ip-10-11-203-20.ec2.internal, partition 199, NODE_LOCAL, 5295 bytes)
17/02/12 04:33:03 INFO YarnSchedulerBackend$YarnDriverEndpoint: Launching task 1605 on executor id: 18 hostname: ip-10-11-203-20.ec2.internal.
17/02/12 04:33:03 INFO TaskSetManager: Finished task 183.0 in stage 32.0 (TID 1589) in 38574 ms on ip-10-11-203-20.ec2.internal (184/200)
17/02/12 04:33:04 INFO TaskSetManager: Finished task 186.0 in stage 32.0 (TID 1592) in 34329 ms on ip-10-11-203-20.ec2.internal (185/200)
17/02/12 04:33:15 INFO TaskSetManager: Finished task 187.0 in stage 32.0 (TID 1593) in 38905 ms on ip-10-178-43-214.ec2.internal (186/200)
Can anyone please explain what is going on here. Thanks.
Parsing SQL is actually quite fast in spark and if you take look at the beginning of the logs you will found parsing logs for sure.
What you can see now is just execution of query - spark divides every execution stage to tasks (to achieve parallel execution) and these Finished task logs just inform you that your query is in progress.
I have created a sparkUDF. When I run it on spark-shell it runs perfectly fine. But when I register it and use in my sparkSQL query it gives NullPointerException.
scala> test_proc("1605","(#supp In (-1,118)")
16/03/07 10:35:04 INFO TaskSetManager: Finished task 0.0 in stage 21.0 (TID 220) in 62 ms on cdts1hdpdn01d.rxcorp.com (1/1)
16/03/07 10:35:04 INFO YarnScheduler: Removed TaskSet 21.0, whose tasks have all completed, from pool
16/03/07 10:35:04 INFO DAGScheduler: ResultStage 21 (first at :45) finished in 0.062 s 16/03/07 10:35:04 INFO DAGScheduler: Job 16 finished: first at :45, took 2.406408 s
res14: Int = 1
scala>
But when I register it and use it in my sparkSQL query, it gives NPE.
scala> sqlContext.udf.register("store_proc", test_proc _)
scala> hiveContext.sql("select store_proc('1605' , '(#supp In (-1,118)')").first.getInt(0)
16/03/07 10:37:58 INFO ParseDriver: Parsing command: select store_proc('1605' , '(#supp In (-1,118)') 16/03/07 10:37:58 INFO ParseDriver: Parse Completed 16/03/07 10:37:58 INFO SparkContext: Starting job: first at :24
16/03/07 10:37:58 INFO DAGScheduler: Got job 17 (first at :24) with 1 output partitions 16/03/07 10:37:58 INFO DAGScheduler: Final stage: ResultStage 22(first at :24) 16/03/07 10:37:58 INFO DAGScheduler: Parents of final stage: List()
16/03/07 10:37:58 INFO DAGScheduler: Missing parents: List()
16/03/07 10:37:58 INFO DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[86] at first at :24), which has no missing parents
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(10520) called with curMem=1472899, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 10.3 KB, free 2.1 GB)
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(4774) called with curMem=1483419, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 4.7 KB, free 2.1 GB)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 162.44.214.87:47564 (size: 4.7 KB, free: 2.1 GB)
16/03/07 10:37:58 INFO SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:861
16/03/07 10:37:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 22 (MapPartitionsRDD[86] at first at :24)
16/03/07 10:37:58 INFO YarnScheduler: Adding task set 22.0 with 1 tasks
16/03/07 10:37:58 INFO TaskSetManager: Starting task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com, partition 0,PROCESS_LOCAL, 2155 bytes)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on cdts1hdpdn02d.rxcorp.com:33678 (size: 4.7 KB, free: 6.7 GB)
16/03/07 10:37:58 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com): java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:291) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725) at $line20.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.test_proc(:41)
This is sample of my 'test_proc':
def test_proc(x:String, y:String):Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val z:Int = hiveContext.sql("select 7").first.getInt(0)
return z
}
Based on the output from a standalone call it looks like test_proc is executing some kind of Spark action and this cannot work inside UDF because Spark doesn't support nested operations on distributed data structures. If test_proc is using SQLContext this will result in NPP since Spark contexts exist only on the driver.
If that's the case you'll have restructure your code to achieve desired effect either using local (most likely broadcasted) variables or joins.
I am trying to perform this simple Spark job using IntelliJ IDEA in Scala. However, Spark UI stops completely after complete execution of the object. Is there something that I am missing or listening at wrong location? Scala Version - 2.10.4 and Spark - 1.6.0
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
val logFile = "C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
16/02/24 01:24:39 INFO SparkContext: Running Spark version 1.6.0
16/02/24 01:24:40 INFO SecurityManager: Changing view acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: Changing modify acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Sivaram Konanki); users with modify permissions: Set(Sivaram Konanki)
16/02/24 01:24:41 INFO Utils: Successfully started service 'sparkDriver' on port 54881.
16/02/24 01:24:41 INFO Slf4jLogger: Slf4jLogger started
16/02/24 01:24:42 INFO Remoting: Starting remoting
16/02/24 01:24:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.15:54894]
16/02/24 01:24:42 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54894.
16/02/24 01:24:42 INFO SparkEnv: Registering MapOutputTracker
16/02/24 01:24:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/24 01:24:42 INFO DiskBlockManager: Created local directory at C:\Users\Sivaram Konanki\AppData\Local\Temp\blockmgr-dad99e77-f3a6-4a1d-88d8-3b030be0bd0a
16/02/24 01:24:42 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
16/02/24 01:24:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/24 01:24:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/02/24 01:24:42 INFO SparkUI: Started SparkUI at http://192.168.1.15:4040
16/02/24 01:24:42 INFO Executor: Starting executor ID driver on host localhost
16/02/24 01:24:43 INFO Utils: <b>Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 54913.
16/02/24 01:24:43 INFO NettyBlockTransferService: Server created on 54913
16/02/24 01:24:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/24 01:24:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54913 with 2.4 GB RAM, BlockManagerId(driver, localhost, 54913)
16/02/24 01:24:43 INFO BlockManagerMaster: Registered BlockManager
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
16/02/24 01:24:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54913 (size: 13.9 KB, free: 2.4 GB)
16/02/24 01:24:44 INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/24 01:24:45 WARN : Your hostname, OSG-E5450-42 resolves to a loopback/non-reachable address: fe80:0:0:0:d9ff:4f93:5643:703d%wlan3, but we couldn't find any external IP address!
16/02/24 01:24:46 INFO FileInputFormat: Total input paths to process : 1
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:12
16/02/24 01:24:46 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 144.5 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1886.0 B, free 146.3 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54913 (size: 1886.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_1 not found, computing it
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_0 not found, computing it
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:1679+1680
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:0+1679
16/02/24 01:24:46 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/02/24 01:24:46 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/02/24 01:24:46 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/02/24 01:24:46 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/02/24 01:24:46 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 4.7 KB, free 151.0 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:54913 (size: 4.7 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 5.4 KB, free 156.5 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54913 (size: 5.4 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 170 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 143 ms on localhost (2/2)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 0.187 s
16/02/24 01:24:46 INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.303861 s
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:13
16/02/24 01:24:46 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 159.6 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1888.0 B, free 161.5 KB)16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54913 (size: 1888.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_0 locally
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_1 locally
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 34 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 37 ms on localhost (2/2)
Lines with a: 58, Lines with b: 26
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.040 s
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.068350 s
16/02/24 01:24:46 INFO SparkContext: Invoking stop() from shutdown hook
16/02/24 01:24:46 INFO SparkUI: Stopped Spark web UI at http://192.168.1.15:4040
16/02/24 01:24:46 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/24 01:24:46 INFO MemoryStore: MemoryStore cleared
16/02/24 01:24:46 INFO BlockManager: BlockManager stopped
16/02/24 01:24:46 INFO BlockManagerMaster: BlockManagerMaster stopped
16/02/24 01:24:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/24 01:24:46 INFO SparkContext: Successfully stopped SparkContext
16/02/24 01:24:46 INFO ShutdownHookManager: Shutdown hook called
16/02/24 01:24:46 INFO ShutdownHookManager: Deleting directory C:\Users\Sivaram Konanki\AppData\Local\Temp\spark-861b5aef-6732-45e4-a4f4-6769370c555e
You can add a
Thread.sleep(1000000);//For 1000 seconds or more
at the bottom of your spark job, this will allow you to inspect the WebUI in IDEs like IntelliJ while running your Spark Job.
This is an expected behavior. Spark UI is maintained by the SparkContext so it cannot be active after application finished and context has been destroyed.
In the standalone mode information is preserved by the cluster web UI, on Mesos or Yarn you can use history server but in the local mode the only option I am aware of is to keep application running.