Using addFile with pipe on a yarn cluster - scala

I've been using pyspark with my YARN cluster with success. The work I'm
doing involves using the RDD's pipe command to send data through a binary
I've made. I can do this easily in pyspark like so (assuming 'sc' is
already defined):
sc.addFile("./dumb_prog")
t= sc.parallelize(range(10))
t.pipe("dumb_prog")
t.take(10) # Gives expected result
However, if I do the same thing in Scala, the pipe command gets a 'Cannot
run program "dumb_prog": error=2, No such file or directory' error. Here's
the code in the Scala shell:
sc.addFile("./dumb_prog")
val t = sc.parallelize(0 until 10)
val u = t.pipe("dumb_prog")
u.take(10)
Why does this only work in Python and not in Scala? Is there a way I can
get it to work in Scala?
Here is the full error message from the scala side:
[59/3965]
14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17
14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1
output partitions (allowLocal=true)
14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at
<console>:17)
14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List()
14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List()
14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe
at <console>:14), which has no missing parents
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with
curMem=7453, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.1 KB, free 265.4 MB)
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with
curMem=9589, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1389.0 B, free 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on 10.10.0.20:37574 (size: 1389.0 B, free: 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3
(PipedRDD[3] at pipe at <console>:14)
14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with
1 tasks
14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, SERVERNAME, PROCESS_LOCAL, 1201 bytes)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on SERVERNAME:57118 (size: 1389.0 B, free: 530.3 MB)
14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,
SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,
No such file or directory
java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

I ran into a similar issue in spark 1.3.0 in Yarn client mode. When I look in the app cache directory, the file never gets pushed to the executors even when using --files. But when I added the below, it did push to each executor:
sc.addFile("dumb_prog",true)
t.pipe("./dumb_prog")
I think it is a bug, but the above got me past the issue.

Related

what happens when I use a global map variable in scala without broadcasting

In scala, what happens when I use a global map variable in scala without broadcasting?
E.g. if I get a variable using collect* (such as collectAsMap), it seems it is a global variable, and I can use it in all RDD.mapValues() functions without explicitly broadcasting it.
BUT I know spark works distributedly, and it should not be able to process a global memory-stored variable without broadcasting it. So, what happened?
Code example (this code call tf-idf in text, where df is stored in a Map):
//dfMap is a String->int Map in memory
//Array[(String, Int)] = Array((B,2), (A,3), (C,1))
val dfMap = dfrdd.collectAsMap;
//tfrdd is a rdd, and I can use dfMap in its mapValues function
//tfrdd: Array((doc1,Map(A -> 3.0)), (doc2,Map(A -> 2.0, B -> 1.0)))
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.getOrElse(x._1, 1) ) );
tfidfrdd.saveAsTextFile("/somedir/result/");
The code works just fine. My question is what happened there? Does the driver send the dfMap to all workers just like broadcasting or else?
What's the difference if I code broadcasting explicitely like this:
dfMap = sc.broadcast(dfrdd.collectAsMap)
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.value.getOrElse(x._1, 1) )
I've checked more resources and aggregating others' answers and put it in order. The difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is like this:
1) When using external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor.
So when the variable (say a map) is large enough (more than 20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>20K) is suggested to be broadcasted explicitly.
2) When using external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.
There is no difference between a Global Map Variable and a Broadcast variable. If we use a global variable in a map function of an RDD then it will be broadcasted to all nodes. For example:
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> rdd.filter(elem => list.contains(elem)).collect
17/03/16 10:21:53 INFO SparkContext: Starting job: collect at <console>:29
17/03/16 10:21:53 INFO DAGScheduler: Got job 3 (collect at <console>:29) with 4 output partitions
17/03/16 10:21:53 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:29)
17/03/16 10:21:53 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:21:53 INFO DAGScheduler: Missing parents: List()
17/03/16 10:21:53 DEBUG DAGScheduler: submitStage(ResultStage 3)
17/03/16 10:21:53 DEBUG DAGScheduler: missing: List()
17/03/16 10:21:53 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29), which has no missing parents
17/03/16 10:21:53 DEBUG DAGScheduler: submitMissingTasks(ResultStage 3)
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 5.0 KB, free 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4 locally took 1 ms
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4 without replication took 1 ms
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.5 KB, free 366.3 MB)
17/03/16 10:21:53 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.2.123:37645 (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4_piece0 locally took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(1)
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4_piece0 without replication took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 1
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 1
17/03/16 10:21:53 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:996
17/03/16 10:21:53 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29)
17/03/16 10:21:53 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Adding task set 3.0 with 4 tasks
17/03/16 10:21:53 DEBUG TaskSetManager: Epoch for TaskSet 3.0: 0
17/03/16 10:21:53 DEBUG TaskSetManager: Valid locality levels for TaskSet 3.0: NO_PREF, ANY
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 12, localhost, executor driver, partition 0, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 13, localhost, executor driver, partition 1, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 14, localhost, executor driver, partition 2, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID 15, localhost, executor driver, partition 3, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO Executor: Running task 0.0 in stage 3.0 (TID 12)
17/03/16 10:21:53 DEBUG Executor: Task 12's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 INFO Executor: Running task 2.0 in stage 3.0 (TID 14)
17/03/16 10:21:53 INFO Executor: Running task 1.0 in stage 3.0 (TID 13)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1
17/03/16 10:21:53 INFO Executor: Running task 3.0 in stage 3.0 (TID 15)
17/03/16 10:21:53 DEBUG Executor: Task 13's epoch is 0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1 of size 5112 dropped from memory (free 384072627)
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1_piece0 of size 2535 dropped from memory (free 384075162)
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.2.123:37645 in memory (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 14's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 15's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 1, response is 0
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 1
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(3)
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 3
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 3
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3_piece0 of size 3309 dropped from memory (free 384078471)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.2.123:37645 in memory (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3 of size 6904 dropped from memory (free 384085375)
17/03/16 10:21:53 INFO Executor: Finished task 1.0 in stage 3.0 (TID 13). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 3, response is 0
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 3
17/03/16 10:21:53 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:21:53 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 13) in 36 ms on localhost (executor driver) (1/4)
17/03/16 10:21:53 INFO Executor: Finished task 2.0 in stage 3.0 (TID 14). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 3
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 2
17/03/16 10:21:53 INFO Executor: Finished task 0.0 in stage 3.0 (TID 12). 912 bytes result sent to driver
17/03/16 10:21:53 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 14) in 36 ms on localhost (executor driver) (2/4)
17/03/16 10:21:53 INFO Executor: Finished task 3.0 in stage 3.0 (TID 15). 908 bytes result sent to driver
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 1
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID 15) in 36 ms on localhost (executor driver) (3/4)
17/03/16 10:21:53 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 12) in 45 ms on localhost (executor driver) (4/4)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/03/16 10:21:53 INFO DAGScheduler: ResultStage 3 (collect at <console>:29) finished in 0.045 s
17/03/16 10:21:53 DEBUG DAGScheduler: After removal of stage 3, remaining stages = 0
17/03/16 10:21:53 INFO DAGScheduler: Job 3 finished: collect at <console>:29, took 0.097564 s
res4: Array[Int] = Array(1, 2, 3)
In above log we can clearly see that global variable list is broadcasted . So, is the case when we explicitly broadcast the list.
scala> val br = sc.broadcast(list)
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 160.0 B, free 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5 without replication took 1 ms
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 227.0 B, free 366.3 MB)
17/03/16 10:26:40 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.2.123:37645 (size: 227.0 B, free: 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManagerMaster: Updated info of block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Told master about block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5_piece0 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5_piece0 without replication took 1 ms
17/03/16 10:26:40 INFO SparkContext: Created broadcast 5 from broadcast at <console>:26
br: org.apache.spark.broadcast.Broadcast[List[Int]] = Broadcast(5)
scala> rdd.filter(elem => br.value.contains(elem)).collect
17/03/16 10:27:50 INFO SparkContext: Starting job: collect at <console>:31
17/03/16 10:27:50 INFO DAGScheduler: Got job 0 (collect at <console>:31) with 4 output partitions
17/03/16 10:27:50 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:31)
17/03/16 10:27:50 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:27:50 INFO DAGScheduler: Missing parents: List()
17/03/16 10:27:50 DEBUG DAGScheduler: submitStage(ResultStage 0)
17/03/16 10:27:50 DEBUG DAGScheduler: missing: List()
17/03/16 10:27:50 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31), which has no missing parents
17/03/16 10:27:50 DEBUG DAGScheduler: submitMissingTasks(ResultStage 0)
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.7 KB, free 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1 locally took 6 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1 without replication took 6 ms
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KB, free 366.3 MB)
17/03/16 10:27:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.2.123:37303 (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1_piece0 locally took 2 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1_piece0 without replication took 2 ms
17/03/16 10:27:50 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/03/16 10:27:50 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31)
17/03/16 10:27:50 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:27:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/03/16 10:27:50 DEBUG TaskSetManager: Epoch for TaskSet 0.0: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:50 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/03/16 10:27:51 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/03/16 10:27:51 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/03/16 10:27:51 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/03/16 10:27:51 DEBUG Executor: Task 0's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 2's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 3's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 1's epoch is 0
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 908 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 999 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 912 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 912 bytes result sent to driver
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 3
17/03/16 10:27:51 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 2
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 1
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:51 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 165 ms on localhost (executor driver) (1/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 180 ms on localhost (executor driver) (2/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 249 ms on localhost (executor driver) (3/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 186 ms on localhost (executor driver) (4/4)
17/03/16 10:27:51 INFO DAGScheduler: ResultStage 0 (collect at <console>:31) finished in 0.264 s
17/03/16 10:27:51 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/16 10:27:51 DEBUG DAGScheduler: After removal of stage 0, remaining stages = 0
17/03/16 10:27:51 INFO DAGScheduler: Job 0 finished: collect at <console>:31, took 0.381615 s
res1: Array[Int] = Array(1, 2, 3)
Same is the case with Broadcast variable.
When you broadcast, the data is cached by all the nodes. so when you are performing an action (collect, saveAsTextFile, head) operation the broadcasted values are already available to all the worker nodes.
But if you do not broadcast the value, when performing an action each worker node needs to perform a shuffle to get the data from the driver node.
First off it is a spark thing - not a scala one
The diff is values are broadcasted everytime they are used whereas explicit broadcasts are cached.
"Broadcast variables are created from a variable v by calling
SparkContext.broadcast(v). The broadcast variable is a wrapper around
v, and its value can be accessed by calling the value method ... After the broadcast variable is created, it should
be used instead of the value v in any functions run on the cluster so
that v is not shipped to the nodes more than once"

Spark-Submit: Failed to open native connection to Cassandra at {10.0.0.5, 10.0.0.4}:9042

I am trying to submit the job using command
"spark-submit --class it.polimi.dice.spark.WordCount --master yarn-master --conf spark.cassandra.connection.host=10.0.0.5 --num-executors 1 --deploy-mode client --driver-memory 512m --executor-memory 512m /home/useruser/temp/spark-cassandra-example/target/scala-2.10/spark-cassandra-exmaple-assembly-1.0.jar"
But I am getting error although I have tried the same thing using spark-shell and it is working it means that version of the spark-Cassandra connector 1.6.0-M1(spark version is 1.6.0 , scala 2.10.5, cassandra 3.3.0) and other configurations are fine. Here is the result I am getting after using spark-submit command.
16/11/21 09:39:03 INFO Client: Application report for application_1479668866076_0014 (state: ACCEPTED)
16/11/21 09:39:04 INFO Client: Application report for application_1479668866076_0014 (state: ACCEPTED)
16/11/21 09:39:05 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/11/21 09:39:05 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> sandbox.hortonworks.com, PROXY_URI_BASES -> http://sandbox.hortonworks.com:8088/proxy/application_1479668866076_0014), /proxy/application_1479668866076_0014
16/11/21 09:39:05 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/11/21 09:39:05 INFO Client: Application report for application_1479668866076_0014 (state: RUNNING)
16/11/23 10:10:38 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
16/11/23 10:10:38 INFO Cluster: New Cassandra host /10.0.0.4:9042 added
16/11/23 10:10:38 INFO Cluster: New Cassandra host /10.0.0.5:9042 added
16/11/23 10:10:38 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/11/23 10:10:39 INFO SparkContext: Starting job: fold at WordCount.scala:17
16/11/23 10:10:39 INFO DAGScheduler: Got job 0 (fold at WordCount.scala:17) with 2 output partitions
16/11/23 10:10:39 INFO DAGScheduler: Final stage: ResultStage 0 (fold at WordCount.scala:17)
16/11/23 10:10:39 INFO DAGScheduler: Parents of final stage: List()
16/11/23 10:10:39 INFO DAGScheduler: Missing parents: List()
16/11/23 10:10:39 INFO DAGScheduler: Submitting ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15), which has no missing parents
16/11/23 10:10:39 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.1 KB, free 7.1 KB)
16/11/23 10:10:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.7 KB, free 10.9 KB)
16/11/23 10:10:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.0.6:58236 (size: 3.7 KB, free: 143.6 MB)
16/11/23 10:10:39 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/11/23 10:10:39 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (CassandraTableScanRDD[1] at RDD at CassandraRDD.scala:15)
16/11/23 10:10:39 INFO YarnScheduler: Adding task set 0.0 with 2 tasks
16/11/23 10:10:39 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, sandbox.hortonworks.com, partition 0,RACK_LOCAL, 29218 bytes)
16/11/23 10:10:40 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on sandbox.hortonworks.com:51013 (size: 3.7 KB, free: 143.6 MB)
16/11/23 10:10:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, sandbox.hortonworks.com, partition 1,RACK_LOCAL, 29156 bytes)
16/11/23 10:10:43 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, sandbox.hortonworks.com): java.io.IOException: Failed to open native connection to Cassandra at {10.0.0.4, 10.0.0.5}:9042
16/11/23 10:10:45 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2, sandbox.hortonworks.com, partition 0,RACK_LOCAL, 29218 bytes)
16/11/23 10:10:45 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on executor sandbox.hortonworks.com: java.io.IOException (Failed to open native connection to Cassandra at {10.0.0.4, 10.0.0.5}:9042) [duplicate 1]
Can anyone kindly help me ho wi can fix this issue.

Scala UDF runs fine on Spark shell but gives NPE when using it in sparkSQL

I have created a sparkUDF. When I run it on spark-shell it runs perfectly fine. But when I register it and use in my sparkSQL query it gives NullPointerException.
scala> test_proc("1605","(#supp In (-1,118)")
16/03/07 10:35:04 INFO TaskSetManager: Finished task 0.0 in stage 21.0 (TID 220) in 62 ms on cdts1hdpdn01d.rxcorp.com (1/1)
16/03/07 10:35:04 INFO YarnScheduler: Removed TaskSet 21.0, whose tasks have all completed, from pool
16/03/07 10:35:04 INFO DAGScheduler: ResultStage 21 (first at :45) finished in 0.062 s 16/03/07 10:35:04 INFO DAGScheduler: Job 16 finished: first at :45, took 2.406408 s
res14: Int = 1
scala>
But when I register it and use it in my sparkSQL query, it gives NPE.
scala> sqlContext.udf.register("store_proc", test_proc _)
scala> hiveContext.sql("select store_proc('1605' , '(#supp In (-1,118)')").first.getInt(0)
16/03/07 10:37:58 INFO ParseDriver: Parsing command: select store_proc('1605' , '(#supp In (-1,118)') 16/03/07 10:37:58 INFO ParseDriver: Parse Completed 16/03/07 10:37:58 INFO SparkContext: Starting job: first at :24
16/03/07 10:37:58 INFO DAGScheduler: Got job 17 (first at :24) with 1 output partitions 16/03/07 10:37:58 INFO DAGScheduler: Final stage: ResultStage 22(first at :24) 16/03/07 10:37:58 INFO DAGScheduler: Parents of final stage: List()
16/03/07 10:37:58 INFO DAGScheduler: Missing parents: List()
16/03/07 10:37:58 INFO DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[86] at first at :24), which has no missing parents
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(10520) called with curMem=1472899, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 10.3 KB, free 2.1 GB)
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(4774) called with curMem=1483419, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 4.7 KB, free 2.1 GB)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 162.44.214.87:47564 (size: 4.7 KB, free: 2.1 GB)
16/03/07 10:37:58 INFO SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:861
16/03/07 10:37:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 22 (MapPartitionsRDD[86] at first at :24)
16/03/07 10:37:58 INFO YarnScheduler: Adding task set 22.0 with 1 tasks
16/03/07 10:37:58 INFO TaskSetManager: Starting task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com, partition 0,PROCESS_LOCAL, 2155 bytes)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on cdts1hdpdn02d.rxcorp.com:33678 (size: 4.7 KB, free: 6.7 GB)
16/03/07 10:37:58 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com): java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:291) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725) at $line20.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.test_proc(:41)
This is sample of my 'test_proc':
def test_proc(x:String, y:String):Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val z:Int = hiveContext.sql("select 7").first.getInt(0)
return z
}
Based on the output from a standalone call it looks like test_proc is executing some kind of Spark action and this cannot work inside UDF because Spark doesn't support nested operations on distributed data structures. If test_proc is using SQLContext this will result in NPP since Spark contexts exist only on the driver.
If that's the case you'll have restructure your code to achieve desired effect either using local (most likely broadcasted) variables or joins.

SparkUI is stopping after execution of code in IntelliJ IDEA

I am trying to perform this simple Spark job using IntelliJ IDEA in Scala. However, Spark UI stops completely after complete execution of the object. Is there something that I am missing or listening at wrong location? Scala Version - 2.10.4 and Spark - 1.6.0
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
val logFile = "C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
16/02/24 01:24:39 INFO SparkContext: Running Spark version 1.6.0
16/02/24 01:24:40 INFO SecurityManager: Changing view acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: Changing modify acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Sivaram Konanki); users with modify permissions: Set(Sivaram Konanki)
16/02/24 01:24:41 INFO Utils: Successfully started service 'sparkDriver' on port 54881.
16/02/24 01:24:41 INFO Slf4jLogger: Slf4jLogger started
16/02/24 01:24:42 INFO Remoting: Starting remoting
16/02/24 01:24:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.15:54894]
16/02/24 01:24:42 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54894.
16/02/24 01:24:42 INFO SparkEnv: Registering MapOutputTracker
16/02/24 01:24:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/24 01:24:42 INFO DiskBlockManager: Created local directory at C:\Users\Sivaram Konanki\AppData\Local\Temp\blockmgr-dad99e77-f3a6-4a1d-88d8-3b030be0bd0a
16/02/24 01:24:42 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
16/02/24 01:24:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/24 01:24:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/02/24 01:24:42 INFO SparkUI: Started SparkUI at http://192.168.1.15:4040
16/02/24 01:24:42 INFO Executor: Starting executor ID driver on host localhost
16/02/24 01:24:43 INFO Utils: <b>Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 54913.
16/02/24 01:24:43 INFO NettyBlockTransferService: Server created on 54913
16/02/24 01:24:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/24 01:24:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54913 with 2.4 GB RAM, BlockManagerId(driver, localhost, 54913)
16/02/24 01:24:43 INFO BlockManagerMaster: Registered BlockManager
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
16/02/24 01:24:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54913 (size: 13.9 KB, free: 2.4 GB)
16/02/24 01:24:44 INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/24 01:24:45 WARN : Your hostname, OSG-E5450-42 resolves to a loopback/non-reachable address: fe80:0:0:0:d9ff:4f93:5643:703d%wlan3, but we couldn't find any external IP address!
16/02/24 01:24:46 INFO FileInputFormat: Total input paths to process : 1
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:12
16/02/24 01:24:46 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 144.5 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1886.0 B, free 146.3 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54913 (size: 1886.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_1 not found, computing it
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_0 not found, computing it
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:1679+1680
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:0+1679
16/02/24 01:24:46 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/02/24 01:24:46 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/02/24 01:24:46 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/02/24 01:24:46 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/02/24 01:24:46 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 4.7 KB, free 151.0 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:54913 (size: 4.7 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 5.4 KB, free 156.5 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54913 (size: 5.4 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 170 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 143 ms on localhost (2/2)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 0.187 s
16/02/24 01:24:46 INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.303861 s
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:13
16/02/24 01:24:46 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 159.6 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1888.0 B, free 161.5 KB)16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54913 (size: 1888.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_0 locally
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_1 locally
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 34 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 37 ms on localhost (2/2)
Lines with a: 58, Lines with b: 26
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.040 s
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.068350 s
16/02/24 01:24:46 INFO SparkContext: Invoking stop() from shutdown hook
16/02/24 01:24:46 INFO SparkUI: Stopped Spark web UI at http://192.168.1.15:4040
16/02/24 01:24:46 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/24 01:24:46 INFO MemoryStore: MemoryStore cleared
16/02/24 01:24:46 INFO BlockManager: BlockManager stopped
16/02/24 01:24:46 INFO BlockManagerMaster: BlockManagerMaster stopped
16/02/24 01:24:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/24 01:24:46 INFO SparkContext: Successfully stopped SparkContext
16/02/24 01:24:46 INFO ShutdownHookManager: Shutdown hook called
16/02/24 01:24:46 INFO ShutdownHookManager: Deleting directory C:\Users\Sivaram Konanki\AppData\Local\Temp\spark-861b5aef-6732-45e4-a4f4-6769370c555e
You can add a
Thread.sleep(1000000);//For 1000 seconds or more
at the bottom of your spark job, this will allow you to inspect the WebUI in IDEs like IntelliJ while running your Spark Job.
This is an expected behavior. Spark UI is maintained by the SparkContext so it cannot be active after application finished and context has been destroyed.
In the standalone mode information is preserved by the cluster web UI, on Mesos or Yarn you can use history server but in the local mode the only option I am aware of is to keep application running.

spark import apache library (math)

I am trying to run a simple application with spark
This is my scala file:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.commons.math3.random.RandomDataGenerator
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/home/donbeo/Applications/spark/spark-1.1.0/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println("A random number")
val randomData = new RandomDataGenerator()
println(randomData.nextLong(0, 100))
}
}
and this is my sbt file
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
libraryDependencies += "org.apache.commons" % "commons-math3" % "3.3"
When I try to run the code I get this error
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Applications/spark/spark-1.1.0$ ./bin/spark-submit --class "SimpleApp" --master local[4] /home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/04 17:42:41 WARN Utils: Your hostname, donbeo-HP-EliteBook-Folio-9470m resolves to a loopback address: 127.0.1.1; using 192.168.1.45 instead (on interface wlan0)
15/02/04 17:42:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/02/04 17:42:41 INFO SecurityManager: Changing view acls to: donbeo,
15/02/04 17:42:41 INFO SecurityManager: Changing modify acls to: donbeo,
15/02/04 17:42:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(donbeo, ); users with modify permissions: Set(donbeo, )
15/02/04 17:42:42 INFO Slf4jLogger: Slf4jLogger started
15/02/04 17:42:42 INFO Remoting: Starting remoting
15/02/04 17:42:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#192.168.1.45:45935]
15/02/04 17:42:42 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver#192.168.1.45:45935]
15/02/04 17:42:42 INFO Utils: Successfully started service 'sparkDriver' on port 45935.
15/02/04 17:42:42 INFO SparkEnv: Registering MapOutputTracker
15/02/04 17:42:42 INFO SparkEnv: Registering BlockManagerMaster
15/02/04 17:42:42 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150204174242-bbb1
15/02/04 17:42:42 INFO Utils: Successfully started service 'Connection manager for block manager' on port 55674.
15/02/04 17:42:42 INFO ConnectionManager: Bound socket to port 55674 with id = ConnectionManagerId(192.168.1.45,55674)
15/02/04 17:42:42 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/02/04 17:42:42 INFO BlockManagerMaster: Trying to register BlockManager
15/02/04 17:42:42 INFO BlockManagerMasterActor: Registering block manager 192.168.1.45:55674 with 265.4 MB RAM
15/02/04 17:42:42 INFO BlockManagerMaster: Registered BlockManager
15/02/04 17:42:42 INFO HttpFileServer: HTTP File server directory is /tmp/spark-49443053-833e-4596-9073-d74075483d35
15/02/04 17:42:42 INFO HttpServer: Starting HTTP Server
15/02/04 17:42:42 INFO Utils: Successfully started service 'HTTP file server' on port 41309.
15/02/04 17:42:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/02/04 17:42:42 INFO SparkUI: Started SparkUI at http://192.168.1.45:4040
15/02/04 17:42:42 INFO SparkContext: Added JAR file:/home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.1.45:41309/jars/simple-project_2.10-1.0.jar with timestamp 1423071762914
15/02/04 17:42:42 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#192.168.1.45:45935/user/HeartbeatReceiver
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(32768) called with curMem=0, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 32.0 KB, free 265.4 MB)
15/02/04 17:42:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/02/04 17:42:43 WARN LoadSnappy: Snappy native library not loaded
15/02/04 17:42:43 INFO FileInputFormat: Total input paths to process : 1
15/02/04 17:42:43 INFO SparkContext: Starting job: count at SimpleApp.scala:13
15/02/04 17:42:43 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:13) with 2 output partitions (allowLocal=false)
15/02/04 17:42:43 INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:13)
15/02/04 17:42:43 INFO DAGScheduler: Parents of final stage: List()
15/02/04 17:42:43 INFO DAGScheduler: Missing parents: List()
15/02/04 17:42:43 INFO DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:13), which has no missing parents
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(2616) called with curMem=32768, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.6 KB, free 265.4 MB)
15/02/04 17:42:43 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (FilteredRDD[2] at filter at SimpleApp.scala:13)
15/02/04 17:42:43 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/02/04 17:42:43 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1283 bytes)
15/02/04 17:42:43 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1283 bytes)
15/02/04 17:42:43 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/02/04 17:42:43 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/04 17:42:43 INFO Executor: Fetching http://192.168.1.45:41309/jars/simple-project_2.10-1.0.jar with timestamp 1423071762914
15/02/04 17:42:43 INFO Utils: Fetching http://192.168.1.45:41309/jars/simple-project_2.10-1.0.jar to /tmp/fetchFileTemp3120003338190168194.tmp
15/02/04 17:42:43 INFO Executor: Adding file:/tmp/spark-ec5e14c2-9e58-4132-a4c9-2569d237a407/simple-project_2.10-1.0.jar to class loader
15/02/04 17:42:43 INFO CacheManager: Partition rdd_1_0 not found, computing it
15/02/04 17:42:43 INFO CacheManager: Partition rdd_1_1 not found, computing it
15/02/04 17:42:43 INFO HadoopRDD: Input split: file:/home/donbeo/Applications/spark/spark-1.1.0/README.md:0+2405
15/02/04 17:42:43 INFO HadoopRDD: Input split: file:/home/donbeo/Applications/spark/spark-1.1.0/README.md:2405+2406
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(7512) called with curMem=35384, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 7.3 KB, free 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerInfo: Added rdd_1_1 in memory on 192.168.1.45:55674 (size: 7.3 KB, free: 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerMaster: Updated info of block rdd_1_1
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(8352) called with curMem=42896, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 8.2 KB, free 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerInfo: Added rdd_1_0 in memory on 192.168.1.45:55674 (size: 8.2 KB, free: 265.4 MB)
15/02/04 17:42:43 INFO BlockManagerMaster: Updated info of block rdd_1_0
15/02/04 17:42:43 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2300 bytes result sent to driver
15/02/04 17:42:43 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2300 bytes result sent to driver
15/02/04 17:42:43 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 179 ms on localhost (1/2)
15/02/04 17:42:43 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 176 ms on localhost (2/2)
15/02/04 17:42:43 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/02/04 17:42:43 INFO DAGScheduler: Stage 0 (count at SimpleApp.scala:13) finished in 0.198 s
15/02/04 17:42:43 INFO SparkContext: Job finished: count at SimpleApp.scala:13, took 0.292364402 s
15/02/04 17:42:43 INFO SparkContext: Starting job: count at SimpleApp.scala:14
15/02/04 17:42:43 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:14) with 2 output partitions (allowLocal=false)
15/02/04 17:42:43 INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.scala:14)
15/02/04 17:42:43 INFO DAGScheduler: Parents of final stage: List()
15/02/04 17:42:43 INFO DAGScheduler: Missing parents: List()
15/02/04 17:42:43 INFO DAGScheduler: Submitting Stage 1 (FilteredRDD[3] at filter at SimpleApp.scala:14), which has no missing parents
15/02/04 17:42:43 INFO MemoryStore: ensureFreeSpace(2616) called with curMem=51248, maxMem=278302556
15/02/04 17:42:43 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 265.4 MB)
15/02/04 17:42:43 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (FilteredRDD[3] at filter at SimpleApp.scala:14)
15/02/04 17:42:43 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/02/04 17:42:43 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, ANY, 1283 bytes)
15/02/04 17:42:43 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, ANY, 1283 bytes)
15/02/04 17:42:43 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
15/02/04 17:42:43 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
15/02/04 17:42:43 INFO BlockManager: Found block rdd_1_1 locally
15/02/04 17:42:43 INFO BlockManager: Found block rdd_1_0 locally
15/02/04 17:42:43 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1731 bytes result sent to driver
15/02/04 17:42:43 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1731 bytes result sent to driver
15/02/04 17:42:43 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 14 ms on localhost (1/2)
15/02/04 17:42:43 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 17 ms on localhost (2/2)
15/02/04 17:42:43 INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:14) finished in 0.017 s
15/02/04 17:42:43 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/02/04 17:42:43 INFO SparkContext: Job finished: count at SimpleApp.scala:14, took 0.034833058 s
Lines with a: 83, Lines with b: 38
A random number
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomDataGenerator
at SimpleApp$.main(SimpleApp.scala:20)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.math3.random.RandomDataGenerator
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 9 more
donbeo#donbeo-HP-EliteBook-Folio-9470m:~/Applications/spark/spark-1.1.0$
I think I am doing something wrong when I import the math3 library.
Here there is a detailed explanation of how I have installed spark and built the project submit task to Spark
You need to specify common-math3 jar's path, it can be done using --jars option
./bin/spark-submit --class "SimpleApp" \
--master local[4] \
--jars <specify-path-of-commons-math3-jar> \
/home/donbeo/Documents/scala_code/simpleApp/target/scala-2.10/simple-project_2.10-1.0.jar
Alternatively, you can build an assembly jar which contains all the dependencies.
EDIT:
How to build assembly jar:
in file build.sbt
import AssemblyKeys._
import sbtassembly.Plugin._
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
libraryDependencies += "org.apache.commons" % "commons-math3" % "3.3"
// This statement includes the assembly plugin capabilities
assemblySettings
// Configure jar named used with the assembly plug-in
jarName in assembly := "simple-app-assembly.jar"
// A special option to exclude Scala itself form our assembly jar, since Spark
// already bundles Scala.
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
in file project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.11.2")
Then make an assembly jar as follows:
sbt assembly