Apache Spark - MLlib - K-Means Input format - scala

I want to perform a K-Means task and fail training the model and get kicked out of Sparks scala shell before I get my result metrics. I am not sure if the input format is the problem or something else. I use Spark 1.0.0 and my input textile (400MB) looks like this:
ID,Category,PruductSize,PurchaseAMount
86252,3711,15.4,4.18
86252,3504,28,1.25
86252,3703,10.75,8.85
86252,3703,10.5,5.55
86252,2201,64,2.79
12262064,7203,32,8.49
etc.
I am not sure if I can use the first two, because in the MLlib example file there only use floats. So I also tried the last two:
16 2.49
64 3.29
56 1
etc.
My error code in both cases is here:
scala> import org.apache.spark.mllib.clustering.KMeans
import org.apache.spark.mllib.clustering.KMeans
scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors
scala>
scala> // Load and parse the data
scala> val data = sc.textFile("data/outkmeanssm.txt")
14/08/07 16:15:37 INFO MemoryStore: ensureFreeSpace(35456) called with curMem=0, maxMem=318111744
14/08/07 16:15:37 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB)
data: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:14
scala> val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble)))
parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[2] at map at <console>:16
scala>
scala> // Cluster the data into two classes using KMeans
scala> val numClusters = 2
numClusters: Int = 2
scala> val numIterations = 20
numIterations: Int = 20
scala> val clusters = KMeans.train(parsedData, numClusters, numIterations)
14/08/07 16:15:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/08/07 16:15:38 WARN LoadSnappy: Snappy native library not loaded
14/08/07 16:15:38 INFO FileInputFormat: Total input paths to process : 1
14/08/07 16:15:38 INFO SparkContext: Starting job: takeSample at KMeans.scala:260
14/08/07 16:15:38 INFO DAGScheduler: Got job 0 (takeSample at KMeans.scala:260) with 7 output partitions (allowLocal=false)
14/08/07 16:15:38 INFO DAGScheduler: Final stage: Stage 0(takeSample at KMeans.scala:260)
14/08/07 16:15:38 INFO DAGScheduler: Parents of final stage: List()
14/08/07 16:15:38 INFO DAGScheduler: Missing parents: List()
14/08/07 16:15:38 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[6] at map at KMeans.scala:123), which has no missing parents
14/08/07 16:15:39 INFO DAGScheduler: Submitting 7 missing tasks from Stage 0 (MappedRDD[6] at map at KMeans.scala:123)
14/08/07 16:15:39 INFO TaskSchedulerImpl: Adding task set 0.0 with 7 tasks
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:0 as 2221 bytes in 3 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:1 as 2221 bytes in 0 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:2 as 2221 bytes in 0 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:3 as 2221 bytes in 1 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL)
14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:4 as 2221 bytes in 0 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:5 as 2221 bytes in 0 ms
14/08/07 16:15:39 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on executor localhost: localhost (PROCESS_LOCAL)
14/08/07 16:15:39 INFO TaskSetManager: Serialized task 0.0:6 as 2221 bytes in 0 ms
14/08/07 16:15:39 INFO Executor: Running task ID 4
14/08/07 16:15:39 INFO Executor: Running task ID 1
14/08/07 16:15:39 INFO Executor: Running task ID 5
14/08/07 16:15:39 INFO Executor: Running task ID 6
14/08/07 16:15:39 INFO Executor: Running task ID 0
14/08/07 16:15:39 INFO Executor: Running task ID 3
14/08/07 16:15:39 INFO Executor: Running task ID 2
14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0 locally
14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0 locally
14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0 locally
14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0 locally
14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0 locally
14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0 locally
14/08/07 16:15:39 INFO BlockManager: Found block broadcast_0 locally
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:0+33554432
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:100663296+33554432
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:201326592+24305610
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:33554432+33554432
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:67108864+33554432
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:134217728+33554432
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:167772160+33554432
14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_0 not found, computing it
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:0+33554432
14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_2 not found, computing it
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:67108864+33554432
14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_1 not found, computing it
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:33554432+33554432
14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_4 not found, computing it
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:134217728+33554432
14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_6 not found, computing it
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:201326592+24305610
14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_3 not found, computing it
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:100663296+33554432
14/08/07 16:15:39 INFO CacheManager: Partition rdd_3_5 not found, computing it
14/08/07 16:15:39 INFO HadoopRDD: Input split: file:/Users/admin/BD_Tools/spark-1.0.0/data/outkmeanssm.txt:167772160+33554432
14/08/07 16:16:53 ERROR Executor: Exception in task ID 5
java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47)
at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83)
at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
14/08/07 16:16:59 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-5,5,main]
java.lang.OutOfMemoryError: Java heap space
at scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:99)
at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:47)
at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:83)
at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:47)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:107)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
at org.apache.spark.rdd.ZippedRDD.compute(ZippedRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
14/08/07 16:17:00 WARN TaskSetManager: Lost TID 5 (task 0.0:5)
Chairs-MacBook-Pro:spark-1.0.0 admin$
Chairs-MacBook-Pro:spark-1.0.0 admin$ // Evaluate clustering by computing Within Set Sum of Squared Errors
-bash: //: is a directory
Chairs-MacBook-Pro:spark-1.0.0 admin$ val WSSSE = clusters.computeCost(parsedData)
-bash: syntax error near unexpected token `('
Chairs-MacBook-Pro:spark-1.0.0 admin$ println("Within Set Sum of Squared Errors = " + WSSSE)
What am I missing?

The “java.lang.OutOfMemoryError: Java heap space” error you are facing will be triggered when you try to add more data into the heap space area in memory, but the size of this data is larger than the JVM can accommodate in the Java heap space.
This occurs due to the fact the applications deployed on Java Virtual Machine are allowed to use only a limited amount of memory. This limit is specified during application startup. To make things more complex, Java memory is separated into two different regions, one of which is called heap. And you have exhausted the heap.
The first solution should be obvious – when you have ran out of a particular resource, you should increase the availability of such a resource. In our case: when your application does not have enough Java heap space memory to run properly, fixing it is as easy as altering your JVM launch configuration and adding (or increasing if present) the following:
-Xmx1024m

Related

what happens when I use a global map variable in scala without broadcasting

In scala, what happens when I use a global map variable in scala without broadcasting?
E.g. if I get a variable using collect* (such as collectAsMap), it seems it is a global variable, and I can use it in all RDD.mapValues() functions without explicitly broadcasting it.
BUT I know spark works distributedly, and it should not be able to process a global memory-stored variable without broadcasting it. So, what happened?
Code example (this code call tf-idf in text, where df is stored in a Map):
//dfMap is a String->int Map in memory
//Array[(String, Int)] = Array((B,2), (A,3), (C,1))
val dfMap = dfrdd.collectAsMap;
//tfrdd is a rdd, and I can use dfMap in its mapValues function
//tfrdd: Array((doc1,Map(A -> 3.0)), (doc2,Map(A -> 2.0, B -> 1.0)))
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.getOrElse(x._1, 1) ) );
tfidfrdd.saveAsTextFile("/somedir/result/");
The code works just fine. My question is what happened there? Does the driver send the dfMap to all workers just like broadcasting or else?
What's the difference if I code broadcasting explicitely like this:
dfMap = sc.broadcast(dfrdd.collectAsMap)
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.value.getOrElse(x._1, 1) )
I've checked more resources and aggregating others' answers and put it in order. The difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is like this:
1) When using external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor.
So when the variable (say a map) is large enough (more than 20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>20K) is suggested to be broadcasted explicitly.
2) When using external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.
There is no difference between a Global Map Variable and a Broadcast variable. If we use a global variable in a map function of an RDD then it will be broadcasted to all nodes. For example:
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> rdd.filter(elem => list.contains(elem)).collect
17/03/16 10:21:53 INFO SparkContext: Starting job: collect at <console>:29
17/03/16 10:21:53 INFO DAGScheduler: Got job 3 (collect at <console>:29) with 4 output partitions
17/03/16 10:21:53 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:29)
17/03/16 10:21:53 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:21:53 INFO DAGScheduler: Missing parents: List()
17/03/16 10:21:53 DEBUG DAGScheduler: submitStage(ResultStage 3)
17/03/16 10:21:53 DEBUG DAGScheduler: missing: List()
17/03/16 10:21:53 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29), which has no missing parents
17/03/16 10:21:53 DEBUG DAGScheduler: submitMissingTasks(ResultStage 3)
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 5.0 KB, free 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4 locally took 1 ms
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4 without replication took 1 ms
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.5 KB, free 366.3 MB)
17/03/16 10:21:53 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.2.123:37645 (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4_piece0 locally took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(1)
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4_piece0 without replication took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 1
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 1
17/03/16 10:21:53 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:996
17/03/16 10:21:53 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29)
17/03/16 10:21:53 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Adding task set 3.0 with 4 tasks
17/03/16 10:21:53 DEBUG TaskSetManager: Epoch for TaskSet 3.0: 0
17/03/16 10:21:53 DEBUG TaskSetManager: Valid locality levels for TaskSet 3.0: NO_PREF, ANY
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 12, localhost, executor driver, partition 0, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 13, localhost, executor driver, partition 1, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 14, localhost, executor driver, partition 2, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID 15, localhost, executor driver, partition 3, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO Executor: Running task 0.0 in stage 3.0 (TID 12)
17/03/16 10:21:53 DEBUG Executor: Task 12's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 INFO Executor: Running task 2.0 in stage 3.0 (TID 14)
17/03/16 10:21:53 INFO Executor: Running task 1.0 in stage 3.0 (TID 13)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1
17/03/16 10:21:53 INFO Executor: Running task 3.0 in stage 3.0 (TID 15)
17/03/16 10:21:53 DEBUG Executor: Task 13's epoch is 0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1 of size 5112 dropped from memory (free 384072627)
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1_piece0 of size 2535 dropped from memory (free 384075162)
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.2.123:37645 in memory (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 14's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 15's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 1, response is 0
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 1
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(3)
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 3
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 3
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3_piece0 of size 3309 dropped from memory (free 384078471)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.2.123:37645 in memory (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3 of size 6904 dropped from memory (free 384085375)
17/03/16 10:21:53 INFO Executor: Finished task 1.0 in stage 3.0 (TID 13). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 3, response is 0
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 3
17/03/16 10:21:53 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:21:53 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 13) in 36 ms on localhost (executor driver) (1/4)
17/03/16 10:21:53 INFO Executor: Finished task 2.0 in stage 3.0 (TID 14). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 3
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 2
17/03/16 10:21:53 INFO Executor: Finished task 0.0 in stage 3.0 (TID 12). 912 bytes result sent to driver
17/03/16 10:21:53 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 14) in 36 ms on localhost (executor driver) (2/4)
17/03/16 10:21:53 INFO Executor: Finished task 3.0 in stage 3.0 (TID 15). 908 bytes result sent to driver
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 1
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID 15) in 36 ms on localhost (executor driver) (3/4)
17/03/16 10:21:53 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 12) in 45 ms on localhost (executor driver) (4/4)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/03/16 10:21:53 INFO DAGScheduler: ResultStage 3 (collect at <console>:29) finished in 0.045 s
17/03/16 10:21:53 DEBUG DAGScheduler: After removal of stage 3, remaining stages = 0
17/03/16 10:21:53 INFO DAGScheduler: Job 3 finished: collect at <console>:29, took 0.097564 s
res4: Array[Int] = Array(1, 2, 3)
In above log we can clearly see that global variable list is broadcasted . So, is the case when we explicitly broadcast the list.
scala> val br = sc.broadcast(list)
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 160.0 B, free 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5 without replication took 1 ms
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 227.0 B, free 366.3 MB)
17/03/16 10:26:40 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.2.123:37645 (size: 227.0 B, free: 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManagerMaster: Updated info of block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Told master about block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5_piece0 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5_piece0 without replication took 1 ms
17/03/16 10:26:40 INFO SparkContext: Created broadcast 5 from broadcast at <console>:26
br: org.apache.spark.broadcast.Broadcast[List[Int]] = Broadcast(5)
scala> rdd.filter(elem => br.value.contains(elem)).collect
17/03/16 10:27:50 INFO SparkContext: Starting job: collect at <console>:31
17/03/16 10:27:50 INFO DAGScheduler: Got job 0 (collect at <console>:31) with 4 output partitions
17/03/16 10:27:50 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:31)
17/03/16 10:27:50 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:27:50 INFO DAGScheduler: Missing parents: List()
17/03/16 10:27:50 DEBUG DAGScheduler: submitStage(ResultStage 0)
17/03/16 10:27:50 DEBUG DAGScheduler: missing: List()
17/03/16 10:27:50 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31), which has no missing parents
17/03/16 10:27:50 DEBUG DAGScheduler: submitMissingTasks(ResultStage 0)
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.7 KB, free 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1 locally took 6 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1 without replication took 6 ms
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KB, free 366.3 MB)
17/03/16 10:27:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.2.123:37303 (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1_piece0 locally took 2 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1_piece0 without replication took 2 ms
17/03/16 10:27:50 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/03/16 10:27:50 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31)
17/03/16 10:27:50 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:27:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/03/16 10:27:50 DEBUG TaskSetManager: Epoch for TaskSet 0.0: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:50 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/03/16 10:27:51 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/03/16 10:27:51 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/03/16 10:27:51 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/03/16 10:27:51 DEBUG Executor: Task 0's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 2's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 3's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 1's epoch is 0
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 908 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 999 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 912 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 912 bytes result sent to driver
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 3
17/03/16 10:27:51 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 2
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 1
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:51 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 165 ms on localhost (executor driver) (1/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 180 ms on localhost (executor driver) (2/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 249 ms on localhost (executor driver) (3/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 186 ms on localhost (executor driver) (4/4)
17/03/16 10:27:51 INFO DAGScheduler: ResultStage 0 (collect at <console>:31) finished in 0.264 s
17/03/16 10:27:51 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/16 10:27:51 DEBUG DAGScheduler: After removal of stage 0, remaining stages = 0
17/03/16 10:27:51 INFO DAGScheduler: Job 0 finished: collect at <console>:31, took 0.381615 s
res1: Array[Int] = Array(1, 2, 3)
Same is the case with Broadcast variable.
When you broadcast, the data is cached by all the nodes. so when you are performing an action (collect, saveAsTextFile, head) operation the broadcasted values are already available to all the worker nodes.
But if you do not broadcast the value, when performing an action each worker node needs to perform a shuffle to get the data from the driver node.
First off it is a spark thing - not a scala one
The diff is values are broadcasted everytime they are used whereas explicit broadcasts are cached.
"Broadcast variables are created from a variable v by calling
SparkContext.broadcast(v). The broadcast variable is a wrapper around
v, and its value can be accessed by calling the value method ... After the broadcast variable is created, it should
be used instead of the value v in any functions run on the cluster so
that v is not shipped to the nodes more than once"

Scala UDF runs fine on Spark shell but gives NPE when using it in sparkSQL

I have created a sparkUDF. When I run it on spark-shell it runs perfectly fine. But when I register it and use in my sparkSQL query it gives NullPointerException.
scala> test_proc("1605","(#supp In (-1,118)")
16/03/07 10:35:04 INFO TaskSetManager: Finished task 0.0 in stage 21.0 (TID 220) in 62 ms on cdts1hdpdn01d.rxcorp.com (1/1)
16/03/07 10:35:04 INFO YarnScheduler: Removed TaskSet 21.0, whose tasks have all completed, from pool
16/03/07 10:35:04 INFO DAGScheduler: ResultStage 21 (first at :45) finished in 0.062 s 16/03/07 10:35:04 INFO DAGScheduler: Job 16 finished: first at :45, took 2.406408 s
res14: Int = 1
scala>
But when I register it and use it in my sparkSQL query, it gives NPE.
scala> sqlContext.udf.register("store_proc", test_proc _)
scala> hiveContext.sql("select store_proc('1605' , '(#supp In (-1,118)')").first.getInt(0)
16/03/07 10:37:58 INFO ParseDriver: Parsing command: select store_proc('1605' , '(#supp In (-1,118)') 16/03/07 10:37:58 INFO ParseDriver: Parse Completed 16/03/07 10:37:58 INFO SparkContext: Starting job: first at :24
16/03/07 10:37:58 INFO DAGScheduler: Got job 17 (first at :24) with 1 output partitions 16/03/07 10:37:58 INFO DAGScheduler: Final stage: ResultStage 22(first at :24) 16/03/07 10:37:58 INFO DAGScheduler: Parents of final stage: List()
16/03/07 10:37:58 INFO DAGScheduler: Missing parents: List()
16/03/07 10:37:58 INFO DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[86] at first at :24), which has no missing parents
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(10520) called with curMem=1472899, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 10.3 KB, free 2.1 GB)
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(4774) called with curMem=1483419, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 4.7 KB, free 2.1 GB)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 162.44.214.87:47564 (size: 4.7 KB, free: 2.1 GB)
16/03/07 10:37:58 INFO SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:861
16/03/07 10:37:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 22 (MapPartitionsRDD[86] at first at :24)
16/03/07 10:37:58 INFO YarnScheduler: Adding task set 22.0 with 1 tasks
16/03/07 10:37:58 INFO TaskSetManager: Starting task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com, partition 0,PROCESS_LOCAL, 2155 bytes)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on cdts1hdpdn02d.rxcorp.com:33678 (size: 4.7 KB, free: 6.7 GB)
16/03/07 10:37:58 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com): java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:291) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725) at $line20.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.test_proc(:41)
This is sample of my 'test_proc':
def test_proc(x:String, y:String):Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val z:Int = hiveContext.sql("select 7").first.getInt(0)
return z
}
Based on the output from a standalone call it looks like test_proc is executing some kind of Spark action and this cannot work inside UDF because Spark doesn't support nested operations on distributed data structures. If test_proc is using SQLContext this will result in NPP since Spark contexts exist only on the driver.
If that's the case you'll have restructure your code to achieve desired effect either using local (most likely broadcasted) variables or joins.

SparkUI is stopping after execution of code in IntelliJ IDEA

I am trying to perform this simple Spark job using IntelliJ IDEA in Scala. However, Spark UI stops completely after complete execution of the object. Is there something that I am missing or listening at wrong location? Scala Version - 2.10.4 and Spark - 1.6.0
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
val logFile = "C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
16/02/24 01:24:39 INFO SparkContext: Running Spark version 1.6.0
16/02/24 01:24:40 INFO SecurityManager: Changing view acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: Changing modify acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Sivaram Konanki); users with modify permissions: Set(Sivaram Konanki)
16/02/24 01:24:41 INFO Utils: Successfully started service 'sparkDriver' on port 54881.
16/02/24 01:24:41 INFO Slf4jLogger: Slf4jLogger started
16/02/24 01:24:42 INFO Remoting: Starting remoting
16/02/24 01:24:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.15:54894]
16/02/24 01:24:42 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54894.
16/02/24 01:24:42 INFO SparkEnv: Registering MapOutputTracker
16/02/24 01:24:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/24 01:24:42 INFO DiskBlockManager: Created local directory at C:\Users\Sivaram Konanki\AppData\Local\Temp\blockmgr-dad99e77-f3a6-4a1d-88d8-3b030be0bd0a
16/02/24 01:24:42 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
16/02/24 01:24:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/24 01:24:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/02/24 01:24:42 INFO SparkUI: Started SparkUI at http://192.168.1.15:4040
16/02/24 01:24:42 INFO Executor: Starting executor ID driver on host localhost
16/02/24 01:24:43 INFO Utils: <b>Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 54913.
16/02/24 01:24:43 INFO NettyBlockTransferService: Server created on 54913
16/02/24 01:24:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/24 01:24:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54913 with 2.4 GB RAM, BlockManagerId(driver, localhost, 54913)
16/02/24 01:24:43 INFO BlockManagerMaster: Registered BlockManager
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
16/02/24 01:24:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54913 (size: 13.9 KB, free: 2.4 GB)
16/02/24 01:24:44 INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/24 01:24:45 WARN : Your hostname, OSG-E5450-42 resolves to a loopback/non-reachable address: fe80:0:0:0:d9ff:4f93:5643:703d%wlan3, but we couldn't find any external IP address!
16/02/24 01:24:46 INFO FileInputFormat: Total input paths to process : 1
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:12
16/02/24 01:24:46 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 144.5 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1886.0 B, free 146.3 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54913 (size: 1886.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_1 not found, computing it
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_0 not found, computing it
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:1679+1680
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:0+1679
16/02/24 01:24:46 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/02/24 01:24:46 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/02/24 01:24:46 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/02/24 01:24:46 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/02/24 01:24:46 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 4.7 KB, free 151.0 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:54913 (size: 4.7 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 5.4 KB, free 156.5 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54913 (size: 5.4 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 170 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 143 ms on localhost (2/2)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 0.187 s
16/02/24 01:24:46 INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.303861 s
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:13
16/02/24 01:24:46 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 159.6 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1888.0 B, free 161.5 KB)16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54913 (size: 1888.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_0 locally
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_1 locally
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 34 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 37 ms on localhost (2/2)
Lines with a: 58, Lines with b: 26
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.040 s
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.068350 s
16/02/24 01:24:46 INFO SparkContext: Invoking stop() from shutdown hook
16/02/24 01:24:46 INFO SparkUI: Stopped Spark web UI at http://192.168.1.15:4040
16/02/24 01:24:46 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/24 01:24:46 INFO MemoryStore: MemoryStore cleared
16/02/24 01:24:46 INFO BlockManager: BlockManager stopped
16/02/24 01:24:46 INFO BlockManagerMaster: BlockManagerMaster stopped
16/02/24 01:24:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/24 01:24:46 INFO SparkContext: Successfully stopped SparkContext
16/02/24 01:24:46 INFO ShutdownHookManager: Shutdown hook called
16/02/24 01:24:46 INFO ShutdownHookManager: Deleting directory C:\Users\Sivaram Konanki\AppData\Local\Temp\spark-861b5aef-6732-45e4-a4f4-6769370c555e
You can add a
Thread.sleep(1000000);//For 1000 seconds or more
at the bottom of your spark job, this will allow you to inspect the WebUI in IDEs like IntelliJ while running your Spark Job.
This is an expected behavior. Spark UI is maintained by the SparkContext so it cannot be active after application finished and context has been destroyed.
In the standalone mode information is preserved by the cluster web UI, on Mesos or Yarn you can use history server but in the local mode the only option I am aware of is to keep application running.

Subclassing SparkException doesn't allow for message passing between Master and Workers

I need to build an application where a master node distributes a large dataset to a number of worker nodes for parallel processing. I'm running this application on a single machine and JVM, therefore I've called setMaster("local[4]") on my SparkConf object. I'm using Spark 1.5.2 and Scala 2.10.5 through IntelliJ.
If a certain condition occurs in the portions of the dataset handled by the executors, I need the master node to be notified and perform some action. In addition to that, I need the other executors to die. To that end, I looked around the Scala Spark API and realized that SparkException allows me to do the first portion of what I'm looking for, by propagating the exception (which is Serializable, by the way) to the driver. I have verified this experimentally, as follows:
def main(args:Array[String]) = {
val conf = new SparkConf().setAppName("Spark Exceptions").setMaster("local[4]")
val sc = new SparkContext(conf)
val l = Range(1, 5000)
val parl = sc.parallelize(l, 8);
val mappedRDD = parl.map(func)
try {
val res = mappedRDD.collect()
println(res)
} catch {
case s:SparkException => println("A worker threw an exception.")
case t:Throwable => throw(t)
}
}
def func(i:Int) = {
if(i == 1 || i == 4000)
throw new SparkException("Bad number detected.")
else
Math.pow(i, 2)
}
If you look closely at the example above, you will note that since the original Range contains both 1 and 4000, two failures are guaranteed in the worker nodes. Indeed, I see two executors failing in stderr, while my stdout is populated with:
A worker threw an exception.
Process finished with exit code 0
Unfortunately, the SparkException thrown does not kill the other executors, since, as mentioned before, I can see both executors failing in stderr, while two other executors complete their tasks successfully. So my first question is: is there any way I can immediately kill the other executors once this exception is caught by the driver program?
My second question is a little bit more subtle: I'd like some information to be exchanged from the executors to the worker node about what piece of information caused the error. Sure, I could write to and read from a file, particularly since I'm on the same filesystem, but I'd like a faster and more elegant solution. So I thought I'd subclass SparkException in order to add a field that described what piece of data caused the error:
import org.apache.spark.SparkException
class WorkerViolation(msg:String, data:Any) extends SparkException(msg) {
override def toString = "A worker violation occurred: " + msg
def getData = data
def this(dat:Any) = this("Error at worker.", dat)
}
The goal is to be able to use the getData accessor to retrieve some information. To that end, I tried modifying the program above, as follows:
...
catch {
case w:WorkerViolation => println("A worker threw an exception, with data: " + w.getData)
case t:Throwable => throw(t)
}
}
def func(i:Int) = {
if(i == 1 || i == 4000)
throw new WorkerViolation("Bad number detected.", i)
else
Math.pow(i, 2)
}
Note that this time I'm both throwing and catching WorkerViolations. Unfortunately, this particular exception seems to be killing the driver node as well. The full trace is of course gigantic, yet copied for consistency:
15/12/07 18:31:17 WARN util.Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.2.222 instead (on interface eth0)
15/12/07 18:31:17 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/07 18:31:17 INFO spark.SecurityManager: Changing view acls to: jason
15/12/07 18:31:17 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jason)
15/12/07 18:31:17 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/12/07 18:31:17 INFO Remoting: Starting remoting
15/12/07 18:31:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark#192.168.2.222:33572]
15/12/07 18:31:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark#192.168.2.222:33572]
15/12/07 18:31:17 INFO spark.SparkEnv: Registering MapOutputTracker
15/12/07 18:31:17 INFO spark.SparkEnv: Registering BlockManagerMaster
15/12/07 18:31:17 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20151207183117-4300
15/12/07 18:31:17 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB.
15/12/07 18:31:17 INFO network.ConnectionManager: Bound socket to port 34704 with id = ConnectionManagerId(192.168.2.222,34704)
15/12/07 18:31:17 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/12/07 18:31:17 INFO storage.BlockManagerInfo: Registering block manager 192.168.2.222:34704 with 2.1 GB RAM
15/12/07 18:31:17 INFO storage.BlockManagerMaster: Registered BlockManager
15/12/07 18:31:17 INFO spark.HttpServer: Starting HTTP Server
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:42426
15/12/07 18:31:17 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.2.222:42426
15/12/07 18:31:17 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-0ae72587-14c5-4bfe-a151-2bcafc889ee8
15/12/07 18:31:17 INFO spark.HttpServer: Starting HTTP Server
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:55556
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/12/07 18:31:17 INFO ui.SparkUI: Started SparkUI at http://192.168.2.222:4040
15/12/07 18:31:18 INFO spark.SparkContext: Starting job: collect at SparkExceptions.scala:16
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Got job 0 (collect at SparkExceptions.scala:16) with 8 output partitions (allowLocal=false)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at SparkExceptions.scala:16)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Missing parents: List()
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkExceptions.scala:14), which has no missing parents
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from Stage 0 (MappedRDD[1] at map at SparkExceptions.scala:14)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1350 bytes in 4 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:2 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:3 as 1350 bytes in 1 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 3
15/12/07 18:31:18 INFO executor.Executor: Running task ID 1
15/12/07 18:31:18 INFO executor.Executor: Running task ID 0
15/12/07 18:31:18 INFO executor.Executor: Running task ID 2
15/12/07 18:31:18 ERROR executor.Executor: Exception in task ID 0
A worker violation occurred: Bad number detected.
at SparkExceptions$.func(SparkExceptions.scala:26)
at SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 2 is 5565
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 1 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 2 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Sending result for 1 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 3 is 5565
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 2
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 1
15/12/07 18:31:18 INFO executor.Executor: Sending result for 3 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 3
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:4 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 4
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:5 as 1350 bytes in 1 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 5
15/12/07 18:31:18 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 4 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 4 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 4
15/12/07 18:31:18 WARN scheduler.TaskSetManager: Loss was due to helpers.WorkerViolation
A worker violation occurred: Bad number detected.
at SparkExceptions$.func(SparkExceptions.scala:26)
at SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 5 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 5 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 5
15/12/07 18:31:18 ERROR scheduler.TaskSetManager: Task 0.0:0 failed 1 times; aborting job
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 2 in 27 ms on localhost (progress: 1/8)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 1 in 30 ms on localhost (progress: 2/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 4 in 11 ms on localhost (progress: 3/8)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Failed to run collect at SparkExceptions.scala:16
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: A worker violation occurred: Bad number detected.
SparkExceptions$.func(SparkExceptions.scala:26)
SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 5 in 11 ms on localhost (progress: 4/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 3 in 34 ms on localhost (progress: 5/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Process finished with exit code 1
So my second question would then be: Why does throwing an exception of a class derived from SparkException kill the driver program as well? Is there a different strategy I can use for executor-driver communication?
FWIW, I have decided that in order to allow for a higher degree of message-passing between nodes, going down to the level of akka actors is the preferred way to go.

Using addFile with pipe on a yarn cluster

I've been using pyspark with my YARN cluster with success. The work I'm
doing involves using the RDD's pipe command to send data through a binary
I've made. I can do this easily in pyspark like so (assuming 'sc' is
already defined):
sc.addFile("./dumb_prog")
t= sc.parallelize(range(10))
t.pipe("dumb_prog")
t.take(10) # Gives expected result
However, if I do the same thing in Scala, the pipe command gets a 'Cannot
run program "dumb_prog": error=2, No such file or directory' error. Here's
the code in the Scala shell:
sc.addFile("./dumb_prog")
val t = sc.parallelize(0 until 10)
val u = t.pipe("dumb_prog")
u.take(10)
Why does this only work in Python and not in Scala? Is there a way I can
get it to work in Scala?
Here is the full error message from the scala side:
[59/3965]
14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17
14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1
output partitions (allowLocal=true)
14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at
<console>:17)
14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List()
14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List()
14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe
at <console>:14), which has no missing parents
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with
curMem=7453, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.1 KB, free 265.4 MB)
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with
curMem=9589, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1389.0 B, free 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on 10.10.0.20:37574 (size: 1389.0 B, free: 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3
(PipedRDD[3] at pipe at <console>:14)
14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with
1 tasks
14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, SERVERNAME, PROCESS_LOCAL, 1201 bytes)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on SERVERNAME:57118 (size: 1389.0 B, free: 530.3 MB)
14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,
SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,
No such file or directory
java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
I ran into a similar issue in spark 1.3.0 in Yarn client mode. When I look in the app cache directory, the file never gets pushed to the executors even when using --files. But when I added the below, it did push to each executor:
sc.addFile("dumb_prog",true)
t.pipe("./dumb_prog")
I think it is a bug, but the above got me past the issue.