RDD.map function hangs in Spark - scala

I am trying to concatenate two spark dataframes of equal length like -
DF1 -
| A |
| 1 |
| 2 |
| 3 |
| 4 |
DF2 -
| B |
| a |
| b |
| c |
| d |
Result DF -
| A | B |
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
For this, I am using below code -
val combinedRow = df1.rdd.zip(df2.select("B").rdd). map({
case (df1Data, df2Data) => {
Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq)
}
})
val combinedschema = StructType(df1.schema.fields ++ df2.select("B").schema.fields)
val resultDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema)
But the code is not making any progress. Its not showing any exception also. Its just stuck.
Any suggestions what may be wrong here ? Thanks in advance.
EDIT -
Logs generated after successful execution of the latest statement.
[main] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 13.848847 ms
[broadcast-exchange-0] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 14.323824 ms
[broadcast-exchange-0] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_35 stored as values in memory (estimated size 1024.1 KB, free 871.5 MB)
[broadcast-exchange-0] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_35_piece0 stored as bytes in memory (estimated size 417.0 B, free 871.5 MB)
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_35_piece0 in memory on 192.168.20.181:38202 (size: 417.0 B, free: 872.9 MB)
[broadcast-exchange-0] INFO org.apache.spark.SparkContext - Created broadcast 35 from run at ThreadPoolExecutor.java:1142
[main] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 27.697751 ms
[main] INFO org.apache.spark.SparkContext - Starting job: show at Train.scala:180
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 19 (show at Train.scala:180) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 31 (show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 31 (MapPartitionsRDD[106] at show at Train.scala:180), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_36 stored as values in memory (estimated size 14.3 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_36_piece0 stored as bytes in memory (estimated size 6.4 KB, free 871.5 MB)
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_36_piece0 in memory on 192.168.20.181:38202 (size: 6.4 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 36 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 31 (MapPartitionsRDD[106] at show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 31.0 with 1 tasks
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 31.0 (TID 1267, localhost, executor driver, partition 0, PROCESS_LOCAL, 5961 bytes)
[Executor task launch worker for task 1267] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 31.0 (TID 1267)
[Executor task launch worker for task 1267] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 32.758147 ms
[Executor task launch worker for task 1267] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 20 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 32 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 32 (MapPartitionsRDD[110] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_37 stored as values in memory (estimated size 26.9 KB, free 871.4 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_37_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.4 MB)
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_37_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 37 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 32 (MapPartitionsRDD[110] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 32.0 with 1 tasks
[dispatcher-event-loop-2] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 32.0 (TID 1268, localhost, executor driver, partition 0, PROCESS_LOCAL, 5813 bytes)
[Executor task launch worker for task 1268] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 32.0 (TID 1268)
[Executor task launch worker for task 1268] INFO org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD - closed connection
[Executor task launch worker for task 1268] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 32.0 (TID 1268). 1979 bytes result sent to driver
[task-result-getter-3] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 32.0 (TID 1268) in 132 ms on localhost (executor driver) (1/1)
[task-result-getter-3] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 32.0, whose tasks have all completed, from pool
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 32 (head at Train.scala:161) finished in 0.128 s
[Executor task launch worker for task 1267] INFO org.apache.spark.scheduler.DAGScheduler - Job 20 finished: head at Train.scala:161, took 0.140223 s
[Executor task launch worker for task 1267] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 8.366053 ms
[Executor task launch worker for task 1267] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 31.0 (TID 1267). 1501 bytes result sent to driver
[task-result-getter-0] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 31.0 (TID 1267) in 393 ms on localhost (executor driver) (1/1)
[task-result-getter-0] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 31.0, whose tasks have all completed, from pool
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 31 (show at Train.scala:180) finished in 0.393 s
[main] INFO org.apache.spark.scheduler.DAGScheduler - Job 19 finished: show at Train.scala:180, took 0.413534 s
[main] INFO org.apache.spark.SparkContext - Starting job: show at Train.scala:180
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 21 (show at Train.scala:180) with 4 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 33 (show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 33 (MapPartitionsRDD[106] at show at Train.scala:180), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_38 stored as values in memory (estimated size 14.3 KB, free 871.4 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_38_piece0 stored as bytes in memory (estimated size 6.4 KB, free 871.4 MB)
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_38_piece0 in memory on 192.168.20.181:38202 (size: 6.4 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 38 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 4 missing tasks from ResultStage 33 (MapPartitionsRDD[106] at show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 33.0 with 4 tasks
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 33.0 (TID 1269, localhost, executor driver, partition 1, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 1.0 in stage 33.0 (TID 1270, localhost, executor driver, partition 2, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 2.0 in stage 33.0 (TID 1271, localhost, executor driver, partition 3, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 3.0 in stage 33.0 (TID 1272, localhost, executor driver, partition 4, PROCESS_LOCAL, 5961 bytes)
[Executor task launch worker for task 1269] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 33.0 (TID 1269)
[Executor task launch worker for task 1271] INFO org.apache.spark.executor.Executor - Running task 2.0 in stage 33.0 (TID 1271)
[Executor task launch worker for task 1272] INFO org.apache.spark.executor.Executor - Running task 3.0 in stage 33.0 (TID 1272)
[Executor task launch worker for task 1270] INFO org.apache.spark.executor.Executor - Running task 1.0 in stage 33.0 (TID 1270)
[Executor task launch worker for task 1269] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 55.127045 ms
[Executor task launch worker for task 1271] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 22 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 34 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 34 (MapPartitionsRDD[117] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_39 stored as values in memory (estimated size 26.9 KB, free 871.4 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned shuffle 10
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31267
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31268
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25303
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_34_piece0 on 192.168.20.181:38202 in memory (size: 22.8 KB, free: 872.9 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25298
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25304
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31269
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned shuffle 11
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25299
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25301
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25300
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_37_piece0 on 192.168.20.181:38202 in memory (size: 12.7 KB, free: 872.9 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_33_piece0 on 192.168.20.181:38202 in memory (size: 22.6 KB, free: 872.9 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25305
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_36_piece0 on 192.168.20.181:38202 in memory (size: 6.4 KB, free: 873.0 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25302
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_39_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.6 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_39_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 39 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 34 (MapPartitionsRDD[117] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 34.0 with 1 tasks
[Executor task launch worker for task 1272] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 92.57204 ms
[Executor task launch worker for task 1269] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 23 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 35 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 35 (MapPartitionsRDD[122] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_40 stored as values in memory (estimated size 26.9 KB, free 871.6 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_40_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_40_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 40 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 35 (MapPartitionsRDD[122] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 35.0 with 1 tasks
[Executor task launch worker for task 1270] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 24 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 36 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 36 (MapPartitionsRDD[124] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_41 stored as values in memory (estimated size 26.9 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_41_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_41_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 41 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 36 (MapPartitionsRDD[124] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 36.0 with 1 tasks
[Executor task launch worker for task 1272] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 25 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 37 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 37 (MapPartitionsRDD[126] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_42 stored as values in memory (estimated size 26.9 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_42_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_42_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 42 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 37 (MapPartitionsRDD[126] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 37.0 with 1 tasks
And its stuck here.

I can't replicate any error in your code it should work fine in your case as well.
You can also simply join two dataframes with a id assigned to both dataframes
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), "id").drop("id")
Hope this helps!

You can also use zipWithIndex on both RDDs:
val df1 = sc.parallelize(Seq("A", "1", "2", "3", "4")).toDF("A")
val df2 = sc.parallelize(Seq("B", "a", "b", "c", "d")).toDF("B")
val zip1 = df1.rdd.zipWithIndex.map { case (k, v) => (v, k.mkString)}
val zip2 = df2.rdd.zipWithIndex.map { case (k, v) => (v, k.mkString)}
zip1.join(zip2).map{ case (k, v) => v }.collect()

Related

what happens when I use a global map variable in scala without broadcasting

In scala, what happens when I use a global map variable in scala without broadcasting?
E.g. if I get a variable using collect* (such as collectAsMap), it seems it is a global variable, and I can use it in all RDD.mapValues() functions without explicitly broadcasting it.
BUT I know spark works distributedly, and it should not be able to process a global memory-stored variable without broadcasting it. So, what happened?
Code example (this code call tf-idf in text, where df is stored in a Map):
//dfMap is a String->int Map in memory
//Array[(String, Int)] = Array((B,2), (A,3), (C,1))
val dfMap = dfrdd.collectAsMap;
//tfrdd is a rdd, and I can use dfMap in its mapValues function
//tfrdd: Array((doc1,Map(A -> 3.0)), (doc2,Map(A -> 2.0, B -> 1.0)))
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.getOrElse(x._1, 1) ) );
tfidfrdd.saveAsTextFile("/somedir/result/");
The code works just fine. My question is what happened there? Does the driver send the dfMap to all workers just like broadcasting or else?
What's the difference if I code broadcasting explicitely like this:
dfMap = sc.broadcast(dfrdd.collectAsMap)
val tfidfrdd = tfrdd.mapValues( e => e.map(x => x._1 -> x._2 * lineNum / dfMap.value.getOrElse(x._1, 1) )
I've checked more resources and aggregating others' answers and put it in order. The difference between using an external variable DIRECTLY(as my so called "global variable"), and BROADCASTING a variable using sc.broadcast() is like this:
1) When using external variable directly, spark will send a copy of the serialized variable together with each TASK. Whereas by sc.broadcast, the variable is sent one copy per EXECUTOR. The number of Task is normally 10 times larger than the Executor.
So when the variable (say a map) is large enough (more than 20K), the former operation may cost a lot time on network transformation and cause frequent GC, which slows the spark down. Hence large variable(>20K) is suggested to be broadcasted explicitly.
2) When using external variable directly the variable is not persisted, it ends with the task and thus can not be reused. Whereas by sc.broadcast() the variable is auto-persisted in the executors' memory, it lasts until you explicitly unpersist it. Thus sc.broadcast variable is available across tasks and stages.
So if the variable is expected to be used multiple times, sc.broadcast() is suggested.
There is no difference between a Global Map Variable and a Broadcast variable. If we use a global variable in a map function of an RDD then it will be broadcasted to all nodes. For example:
scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)
scala> val rdd = sc.parallelize(List(1,2,3,4))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:24
scala> rdd.filter(elem => list.contains(elem)).collect
17/03/16 10:21:53 INFO SparkContext: Starting job: collect at <console>:29
17/03/16 10:21:53 INFO DAGScheduler: Got job 3 (collect at <console>:29) with 4 output partitions
17/03/16 10:21:53 INFO DAGScheduler: Final stage: ResultStage 3 (collect at <console>:29)
17/03/16 10:21:53 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:21:53 INFO DAGScheduler: Missing parents: List()
17/03/16 10:21:53 DEBUG DAGScheduler: submitStage(ResultStage 3)
17/03/16 10:21:53 DEBUG DAGScheduler: missing: List()
17/03/16 10:21:53 INFO DAGScheduler: Submitting ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29), which has no missing parents
17/03/16 10:21:53 DEBUG DAGScheduler: submitMissingTasks(ResultStage 3)
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 5.0 KB, free 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4 locally took 1 ms
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4 without replication took 1 ms
17/03/16 10:21:53 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 2.5 KB, free 366.3 MB)
17/03/16 10:21:53 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.2.123:37645 (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_4_piece0
17/03/16 10:21:53 DEBUG BlockManager: Put block broadcast_4_piece0 locally took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(1)
17/03/16 10:21:53 DEBUG BlockManager: Putting block broadcast_4_piece0 without replication took 2 ms
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 1
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 1
17/03/16 10:21:53 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:996
17/03/16 10:21:53 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 3 (MapPartitionsRDD[5] at filter at <console>:29)
17/03/16 10:21:53 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Adding task set 3.0 with 4 tasks
17/03/16 10:21:53 DEBUG TaskSetManager: Epoch for TaskSet 3.0: 0
17/03/16 10:21:53 DEBUG TaskSetManager: Valid locality levels for TaskSet 3.0: NO_PREF, ANY
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 12, localhost, executor driver, partition 0, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID 13, localhost, executor driver, partition 1, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID 14, localhost, executor driver, partition 2, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID 15, localhost, executor driver, partition 3, PROCESS_LOCAL, 5886 bytes)
17/03/16 10:21:53 INFO Executor: Running task 0.0 in stage 3.0 (TID 12)
17/03/16 10:21:53 DEBUG Executor: Task 12's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 INFO Executor: Running task 2.0 in stage 3.0 (TID 14)
17/03/16 10:21:53 INFO Executor: Running task 1.0 in stage 3.0 (TID 13)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 1
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1
17/03/16 10:21:53 INFO Executor: Running task 3.0 in stage 3.0 (TID 15)
17/03/16 10:21:53 DEBUG Executor: Task 13's epoch is 0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1 of size 5112 dropped from memory (free 384072627)
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_1_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_1_piece0 of size 2535 dropped from memory (free 384075162)
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.2.123:37645 in memory (size: 2.5 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 14's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG Executor: Task 15's epoch is 0
17/03/16 10:21:53 DEBUG BlockManager: Getting local block broadcast_4
17/03/16 10:21:53 DEBUG BlockManager: Level for block broadcast_4 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 1, response is 0
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 1
17/03/16 10:21:53 DEBUG ContextCleaner: Got cleaning task CleanBroadcast(3)
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaning broadcast 3
17/03/16 10:21:53 DEBUG TorrentBroadcast: Unpersisting TorrentBroadcast 3
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing broadcast 3
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3_piece0
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3_piece0 of size 3309 dropped from memory (free 384078471)
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.2.123:37645 in memory (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:21:53 DEBUG BlockManagerMaster: Updated info of block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Told master about block broadcast_3_piece0
17/03/16 10:21:53 DEBUG BlockManager: Removing block broadcast_3
17/03/16 10:21:53 DEBUG MemoryStore: Block broadcast_3 of size 6904 dropped from memory (free 384085375)
17/03/16 10:21:53 INFO Executor: Finished task 1.0 in stage 3.0 (TID 13). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Done removing broadcast 3, response is 0
17/03/16 10:21:53 DEBUG BlockManagerSlaveEndpoint: Sent response: 0 to 192.168.2.123:40909
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 3
17/03/16 10:21:53 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:21:53 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 13) in 36 ms on localhost (executor driver) (1/4)
17/03/16 10:21:53 INFO Executor: Finished task 2.0 in stage 3.0 (TID 14). 912 bytes result sent to driver
17/03/16 10:21:53 DEBUG ContextCleaner: Cleaned broadcast 3
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 2
17/03/16 10:21:53 INFO Executor: Finished task 0.0 in stage 3.0 (TID 12). 912 bytes result sent to driver
17/03/16 10:21:53 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 14) in 36 ms on localhost (executor driver) (2/4)
17/03/16 10:21:53 INFO Executor: Finished task 3.0 in stage 3.0 (TID 15). 908 bytes result sent to driver
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 1
17/03/16 10:21:53 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_3.0, runningTasks: 0
17/03/16 10:21:53 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID 15) in 36 ms on localhost (executor driver) (3/4)
17/03/16 10:21:53 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 12) in 45 ms on localhost (executor driver) (4/4)
17/03/16 10:21:53 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/03/16 10:21:53 INFO DAGScheduler: ResultStage 3 (collect at <console>:29) finished in 0.045 s
17/03/16 10:21:53 DEBUG DAGScheduler: After removal of stage 3, remaining stages = 0
17/03/16 10:21:53 INFO DAGScheduler: Job 3 finished: collect at <console>:29, took 0.097564 s
res4: Array[Int] = Array(1, 2, 3)
In above log we can clearly see that global variable list is broadcasted . So, is the case when we explicitly broadcast the list.
scala> val br = sc.broadcast(list)
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 160.0 B, free 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5 without replication took 1 ms
17/03/16 10:26:40 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 227.0 B, free 366.3 MB)
17/03/16 10:26:40 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.2.123:37645 (size: 227.0 B, free: 366.3 MB)
17/03/16 10:26:40 DEBUG BlockManagerMaster: Updated info of block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Told master about block broadcast_5_piece0
17/03/16 10:26:40 DEBUG BlockManager: Put block broadcast_5_piece0 locally took 1 ms
17/03/16 10:26:40 DEBUG BlockManager: Putting block broadcast_5_piece0 without replication took 1 ms
17/03/16 10:26:40 INFO SparkContext: Created broadcast 5 from broadcast at <console>:26
br: org.apache.spark.broadcast.Broadcast[List[Int]] = Broadcast(5)
scala> rdd.filter(elem => br.value.contains(elem)).collect
17/03/16 10:27:50 INFO SparkContext: Starting job: collect at <console>:31
17/03/16 10:27:50 INFO DAGScheduler: Got job 0 (collect at <console>:31) with 4 output partitions
17/03/16 10:27:50 INFO DAGScheduler: Final stage: ResultStage 0 (collect at <console>:31)
17/03/16 10:27:50 INFO DAGScheduler: Parents of final stage: List()
17/03/16 10:27:50 INFO DAGScheduler: Missing parents: List()
17/03/16 10:27:50 DEBUG DAGScheduler: submitStage(ResultStage 0)
17/03/16 10:27:50 DEBUG DAGScheduler: missing: List()
17/03/16 10:27:50 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31), which has no missing parents
17/03/16 10:27:50 DEBUG DAGScheduler: submitMissingTasks(ResultStage 0)
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.7 KB, free 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1 locally took 6 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1 without replication took 6 ms
17/03/16 10:27:50 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KB, free 366.3 MB)
17/03/16 10:27:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.2.123:37303 (size: 3.2 KB, free: 366.3 MB)
17/03/16 10:27:50 DEBUG BlockManagerMaster: Updated info of block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Told master about block broadcast_1_piece0
17/03/16 10:27:50 DEBUG BlockManager: Put block broadcast_1_piece0 locally took 2 ms
17/03/16 10:27:50 DEBUG BlockManager: Putting block broadcast_1_piece0 without replication took 2 ms
17/03/16 10:27:50 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/03/16 10:27:50 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at filter at <console>:31)
17/03/16 10:27:50 DEBUG DAGScheduler: New pending partitions: Set(0, 1, 2, 3)
17/03/16 10:27:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/03/16 10:27:50 DEBUG TaskSetManager: Epoch for TaskSet 0.0: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:50 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:50 DEBUG TaskSetManager: Valid locality levels for TaskSet 0.0: NO_PREF, ANY
17/03/16 10:27:51 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 5885 bytes)
17/03/16 10:27:51 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/03/16 10:27:51 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/03/16 10:27:51 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/03/16 10:27:51 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/03/16 10:27:51 DEBUG Executor: Task 0's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 2's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 3's epoch is 0
17/03/16 10:27:51 DEBUG Executor: Task 1's epoch is 0
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_1
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_1 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 DEBUG BlockManager: Getting local block broadcast_0
17/03/16 10:27:51 DEBUG BlockManager: Level for block broadcast_0 is StorageLevel(disk, memory, deserialized, 1 replicas)
17/03/16 10:27:51 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 908 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 999 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 912 bytes result sent to driver
17/03/16 10:27:51 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 912 bytes result sent to driver
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 3
17/03/16 10:27:51 DEBUG TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 2
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 1
17/03/16 10:27:51 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_0.0, runningTasks: 0
17/03/16 10:27:51 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 165 ms on localhost (executor driver) (1/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 180 ms on localhost (executor driver) (2/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 249 ms on localhost (executor driver) (3/4)
17/03/16 10:27:51 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 186 ms on localhost (executor driver) (4/4)
17/03/16 10:27:51 INFO DAGScheduler: ResultStage 0 (collect at <console>:31) finished in 0.264 s
17/03/16 10:27:51 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/16 10:27:51 DEBUG DAGScheduler: After removal of stage 0, remaining stages = 0
17/03/16 10:27:51 INFO DAGScheduler: Job 0 finished: collect at <console>:31, took 0.381615 s
res1: Array[Int] = Array(1, 2, 3)
Same is the case with Broadcast variable.
When you broadcast, the data is cached by all the nodes. so when you are performing an action (collect, saveAsTextFile, head) operation the broadcasted values are already available to all the worker nodes.
But if you do not broadcast the value, when performing an action each worker node needs to perform a shuffle to get the data from the driver node.
First off it is a spark thing - not a scala one
The diff is values are broadcasted everytime they are used whereas explicit broadcasts are cached.
"Broadcast variables are created from a variable v by calling
SparkContext.broadcast(v). The broadcast variable is a wrapper around
v, and its value can be accessed by calling the value method ... After the broadcast variable is created, it should
be used instead of the value v in any functions run on the cluster so
that v is not shipped to the nodes more than once"

Can't write to Cluster if replication__factor is greater than 1

I'm using Spark 1.6.1, Cassandra 2.2.3 and Cassandra-Spark connector 1.6. .
I already tried to write to multi node cluster but with replication_factor:1.
Now, I'm trying to write to 6-node cluster with one seed one and keyspace which has replication_factor > 1 but Spark is not responding and he is refusing to do that.
As I mention, it works when I'm writing to coordinator with keyspace set to 1.
This is an log which I'm getting and it always stops here or after half an hour he starts to cleaning accumulators and stops on fourth again.
16/08/16 17:07:03 INFO NettyUtil: Found Netty's native epoll transport in the classpath, using it
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.1:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.1 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.2:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.2 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.3:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.3 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.4:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.4 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.5:9042 added
16/08/16 17:07:04 INFO LocalNodeFirstLoadBalancingPolicy: Added host 127.0.0.5 (datacenter1)
16/08/16 17:07:04 INFO Cluster: New Cassandra host /127.0.0.6:9042 added
16/08/16 17:07:04 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
16/08/16 17:07:05 INFO SparkContext: Starting job: take at CassandraRDD.scala:121
16/08/16 17:07:05 INFO DAGScheduler: Got job 3 (take at CassandraRDD.scala:121) with 1 output partitions
16/08/16 17:07:05 INFO DAGScheduler: Final stage: ResultStage 4 (take at CassandraRDD.scala:121)
16/08/16 17:07:05 INFO DAGScheduler: Parents of final stage: List()
16/08/16 17:07:05 INFO DAGScheduler: Missing parents: List()
16/08/16 17:07:05 INFO DAGScheduler: Submitting ResultStage 4 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18), which has no missing parents
16/08/16 17:07:05 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 8.3 KB, free 170.5 KB)
16/08/16 17:07:05 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 4.2 KB, free 174.7 KB)
16/08/16 17:07:05 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:43680 (size: 4.2 KB, free: 756.4 MB)
16/08/16 17:07:05 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006
16/08/16 17:07:05 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18)
16/08/16 17:07:05 INFO TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
16/08/16 17:07:05 INFO TaskSetManager: Starting task 0.0 in stage 4.0 (TID 204, localhost, partition 0,NODE_LOCAL, 22553 bytes)
16/08/16 17:07:05 INFO Executor: Running task 0.0 in stage 4.0 (TID 204)
16/08/16 17:07:06 INFO Executor: Finished task 0.0 in stage 4.0 (TID 204). 2074 bytes result sent to driver
16/08/16 17:07:06 INFO TaskSetManager: Finished task 0.0 in stage 4.0 (TID 204) in 1267 ms on localhost (1/1)
16/08/16 17:07:06 INFO DAGScheduler: ResultStage 4 (take at CassandraRDD.scala:121) finished in 1.276 s
16/08/16 17:07:06 INFO TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
16/08/16 17:07:06 INFO DAGScheduler: Job 3 finished: take at CassandraRDD.scala:121, took 1.310929 s
16/08/16 17:07:06 INFO SparkContext: Starting job: take at CassandraRDD.scala:121
16/08/16 17:07:06 INFO DAGScheduler: Got job 4 (take at CassandraRDD.scala:121) with 4 output partitions
16/08/16 17:07:06 INFO DAGScheduler: Final stage: ResultStage 5 (take at CassandraRDD.scala:121)
16/08/16 17:07:06 INFO DAGScheduler: Parents of final stage: List()
16/08/16 17:07:06 INFO DAGScheduler: Missing parents: List()
16/08/16 17:07:06 INFO DAGScheduler: Submitting ResultStage 5 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18), which has no missing parents
16/08/16 17:07:06 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 8.4 KB, free 183.1 KB)
16/08/16 17:07:06 INFO MemoryStore: Block broadcast_8_piece0 stored as byt es in memory (estimated size 4.2 KB, free 187.3 KB)
16/08/16 17:07:06 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:43680 (size: 4.2 KB, free: 756.3 MB)
16/08/16 17:07:06 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:1006
16/08/16 17:07:06 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 5 (CassandraTableScanRDD[17] at RDD at CassandraRDD.scala:18)
16/08/16 17:07:06 INFO TaskSchedulerImpl: Adding task set 5.0 with 4 tasks
16/08/16 17:07:06 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 205, localhost, partition 1,NODE_LOCAL, 22553 bytes)
16/08/16 17:07:06 INFO Executor: Running task 0.0 in stage 5.0 (TID 205)
16/08/16 17:07:07 INFO Executor: Finished task 0.0 in stage 5.0 (TID 205). 2074 bytes result sent to driver
16/08/16 17:07:07 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 205) in 706 ms on localhost (1/4)
16/08/16 17:07:14 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_7_piece0 on localhost:43680 in memory (size: 4.2 KB, free: 756.4 MB)
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 14
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 13
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost:43680 in memory (size: 7.1 KB, free: 756.4 MB)
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 12
16/08/16 17:32:40 INFO ContextCleaner: Cleaned shuffle 0
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 11
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 10
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 9
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 8
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 7
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 6
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 5
16/08/16 17:32:40 INFO ContextCleaner: Cleaned accumulator 4
16/08/16 17:32:40 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:43680 in memory (size: 13.8 KB, free: 756.4 MB)
16/08/16 20:45:06 INFO SparkContext: Invoking stop() from shutdown hook
EDIT
This is snippet of code what am I doing exactly:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.types.{StructType, StructField, DateType, IntegerType};
object ff {
def main(string: Array[String]) {
val conf = new SparkConf()
.set("spark.cassandra.connection.host", "127.0.0.1")
.setMaster("local[4]")
.setAppName("ff")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true")
.load("test.csv")
df.registerTempTable("ff_table")
//df.printSchema()
df.count
time {
df.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "ff_table", "keyspace" -> "traffic"))
.save()
}
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: " + (System.nanoTime - s) / 1e6 + "ms")
ret
}
}
}
Also, if I run nodetool describecluster I got this results:
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
bf6c3ae7-5c8b-3e5d-9794-8e34bee9278f: [127.0.0.1, 127.0.0.2, 127.0.0.3, 127.0.0.4, 127.0.0.5, 127.0.0.6]
My keyspace configuration:
CREATE KEYSPACE traffic WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} AND durable_writes = true;
I tried to insert in CLI on row for replication_factor:3 and it's working, so every node can see each other.
Why Spark can't insert anything than, anyone idea?

Scala UDF runs fine on Spark shell but gives NPE when using it in sparkSQL

I have created a sparkUDF. When I run it on spark-shell it runs perfectly fine. But when I register it and use in my sparkSQL query it gives NullPointerException.
scala> test_proc("1605","(#supp In (-1,118)")
16/03/07 10:35:04 INFO TaskSetManager: Finished task 0.0 in stage 21.0 (TID 220) in 62 ms on cdts1hdpdn01d.rxcorp.com (1/1)
16/03/07 10:35:04 INFO YarnScheduler: Removed TaskSet 21.0, whose tasks have all completed, from pool
16/03/07 10:35:04 INFO DAGScheduler: ResultStage 21 (first at :45) finished in 0.062 s 16/03/07 10:35:04 INFO DAGScheduler: Job 16 finished: first at :45, took 2.406408 s
res14: Int = 1
scala>
But when I register it and use it in my sparkSQL query, it gives NPE.
scala> sqlContext.udf.register("store_proc", test_proc _)
scala> hiveContext.sql("select store_proc('1605' , '(#supp In (-1,118)')").first.getInt(0)
16/03/07 10:37:58 INFO ParseDriver: Parsing command: select store_proc('1605' , '(#supp In (-1,118)') 16/03/07 10:37:58 INFO ParseDriver: Parse Completed 16/03/07 10:37:58 INFO SparkContext: Starting job: first at :24
16/03/07 10:37:58 INFO DAGScheduler: Got job 17 (first at :24) with 1 output partitions 16/03/07 10:37:58 INFO DAGScheduler: Final stage: ResultStage 22(first at :24) 16/03/07 10:37:58 INFO DAGScheduler: Parents of final stage: List()
16/03/07 10:37:58 INFO DAGScheduler: Missing parents: List()
16/03/07 10:37:58 INFO DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[86] at first at :24), which has no missing parents
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(10520) called with curMem=1472899, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 10.3 KB, free 2.1 GB)
16/03/07 10:37:58 INFO MemoryStore: ensureFreeSpace(4774) called with curMem=1483419, maxMem=2222739947
16/03/07 10:37:58 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 4.7 KB, free 2.1 GB)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 162.44.214.87:47564 (size: 4.7 KB, free: 2.1 GB)
16/03/07 10:37:58 INFO SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:861
16/03/07 10:37:58 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 22 (MapPartitionsRDD[86] at first at :24)
16/03/07 10:37:58 INFO YarnScheduler: Adding task set 22.0 with 1 tasks
16/03/07 10:37:58 INFO TaskSetManager: Starting task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com, partition 0,PROCESS_LOCAL, 2155 bytes)
16/03/07 10:37:58 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on cdts1hdpdn02d.rxcorp.com:33678 (size: 4.7 KB, free: 6.7 GB)
16/03/07 10:37:58 WARN TaskSetManager: Lost task 0.0 in stage 22.0 (TID 221, cdts1hdpdn02d.rxcorp.com): java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.parseSql(HiveContext.scala:291) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:725) at $line20.$read$iwC$iwC$iwC$iwC$iwC$iwC$iwC$iwC.test_proc(:41)
This is sample of my 'test_proc':
def test_proc(x:String, y:String):Int = {
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val z:Int = hiveContext.sql("select 7").first.getInt(0)
return z
}
Based on the output from a standalone call it looks like test_proc is executing some kind of Spark action and this cannot work inside UDF because Spark doesn't support nested operations on distributed data structures. If test_proc is using SQLContext this will result in NPP since Spark contexts exist only on the driver.
If that's the case you'll have restructure your code to achieve desired effect either using local (most likely broadcasted) variables or joins.

SparkUI is stopping after execution of code in IntelliJ IDEA

I am trying to perform this simple Spark job using IntelliJ IDEA in Scala. However, Spark UI stops completely after complete execution of the object. Is there something that I am missing or listening at wrong location? Scala Version - 2.10.4 and Spark - 1.6.0
import org.apache.spark.{SparkConf, SparkContext}
object SimpleApp {
def main(args: Array[String]) {
val logFile = "C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
16/02/24 01:24:39 INFO SparkContext: Running Spark version 1.6.0
16/02/24 01:24:40 INFO SecurityManager: Changing view acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: Changing modify acls to: Sivaram Konanki
16/02/24 01:24:40 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Sivaram Konanki); users with modify permissions: Set(Sivaram Konanki)
16/02/24 01:24:41 INFO Utils: Successfully started service 'sparkDriver' on port 54881.
16/02/24 01:24:41 INFO Slf4jLogger: Slf4jLogger started
16/02/24 01:24:42 INFO Remoting: Starting remoting
16/02/24 01:24:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem#192.168.1.15:54894]
16/02/24 01:24:42 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54894.
16/02/24 01:24:42 INFO SparkEnv: Registering MapOutputTracker
16/02/24 01:24:42 INFO SparkEnv: Registering BlockManagerMaster
16/02/24 01:24:42 INFO DiskBlockManager: Created local directory at C:\Users\Sivaram Konanki\AppData\Local\Temp\blockmgr-dad99e77-f3a6-4a1d-88d8-3b030be0bd0a
16/02/24 01:24:42 INFO MemoryStore: MemoryStore started with capacity 2.4 GB
16/02/24 01:24:42 INFO SparkEnv: Registering OutputCommitCoordinator
16/02/24 01:24:42 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/02/24 01:24:42 INFO SparkUI: Started SparkUI at http://192.168.1.15:4040
16/02/24 01:24:42 INFO Executor: Starting executor ID driver on host localhost
16/02/24 01:24:43 INFO Utils: <b>Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 54913.
16/02/24 01:24:43 INFO NettyBlockTransferService: Server created on 54913
16/02/24 01:24:43 INFO BlockManagerMaster: Trying to register BlockManager
16/02/24 01:24:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54913 with 2.4 GB RAM, BlockManagerId(driver, localhost, 54913)
16/02/24 01:24:43 INFO BlockManagerMaster: Registered BlockManager
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
16/02/24 01:24:44 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
16/02/24 01:24:44 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54913 (size: 13.9 KB, free: 2.4 GB)
16/02/24 01:24:44 INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/24 01:24:45 WARN : Your hostname, OSG-E5450-42 resolves to a loopback/non-reachable address: fe80:0:0:0:d9ff:4f93:5643:703d%wlan3, but we couldn't find any external IP address!
16/02/24 01:24:46 INFO FileInputFormat: Total input paths to process : 1
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:12
16/02/24 01:24:46 INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 144.5 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1886.0 B, free 146.3 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54913 (size: 1886.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_1 not found, computing it
16/02/24 01:24:46 INFO CacheManager: Partition rdd_1_0 not found, computing it
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:1679+1680
16/02/24 01:24:46 INFO HadoopRDD: Input split: file:/C:/spark-1.6.0-bin-hadoop2.6/spark-1.6.0-bin-hadoop2.6/README.md:0+1679
16/02/24 01:24:46 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/02/24 01:24:46 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/02/24 01:24:46 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/02/24 01:24:46 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/02/24 01:24:46 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 4.7 KB, free 151.0 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:54913 (size: 4.7 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 5.4 KB, free 156.5 KB)
16/02/24 01:24:46 INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:54913 (size: 5.4 KB, free: 2.4 GB)
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2662 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 170 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 143 ms on localhost (2/2)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 0.187 s
16/02/24 01:24:46 INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.303861 s
16/02/24 01:24:46 INFO SparkContext: Starting job: count at SimpleApp.scala:13
16/02/24 01:24:46 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/24 01:24:46 INFO DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/24 01:24:46 INFO DAGScheduler: Parents of final stage: List()
16/02/24 01:24:46 INFO DAGScheduler: Missing parents: List()
16/02/24 01:24:46 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 159.6 KB)
16/02/24 01:24:46 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1888.0 B, free 161.5 KB)16/02/24 01:24:46 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54913 (size: 1888.0 B, free: 2.4 GB)
16/02/24 01:24:46 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/24 01:24:46 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/24 01:24:46 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/24 01:24:46 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, partition 1,PROCESS_LOCAL, 2172 bytes)
16/02/24 01:24:46 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
16/02/24 01:24:46 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_0 locally
16/02/24 01:24:46 INFO BlockManager: Found block rdd_1_1 locally
16/02/24 01:24:46 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 2082 bytes result sent to driver
16/02/24 01:24:46 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 34 ms on localhost (1/2)
16/02/24 01:24:46 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 37 ms on localhost (2/2)
Lines with a: 58, Lines with b: 26
16/02/24 01:24:46 INFO DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.040 s
16/02/24 01:24:46 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/02/24 01:24:46 INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.068350 s
16/02/24 01:24:46 INFO SparkContext: Invoking stop() from shutdown hook
16/02/24 01:24:46 INFO SparkUI: Stopped Spark web UI at http://192.168.1.15:4040
16/02/24 01:24:46 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/24 01:24:46 INFO MemoryStore: MemoryStore cleared
16/02/24 01:24:46 INFO BlockManager: BlockManager stopped
16/02/24 01:24:46 INFO BlockManagerMaster: BlockManagerMaster stopped
16/02/24 01:24:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/24 01:24:46 INFO SparkContext: Successfully stopped SparkContext
16/02/24 01:24:46 INFO ShutdownHookManager: Shutdown hook called
16/02/24 01:24:46 INFO ShutdownHookManager: Deleting directory C:\Users\Sivaram Konanki\AppData\Local\Temp\spark-861b5aef-6732-45e4-a4f4-6769370c555e
You can add a
Thread.sleep(1000000);//For 1000 seconds or more
at the bottom of your spark job, this will allow you to inspect the WebUI in IDEs like IntelliJ while running your Spark Job.
This is an expected behavior. Spark UI is maintained by the SparkContext so it cannot be active after application finished and context has been destroyed.
In the standalone mode information is preserved by the cluster web UI, on Mesos or Yarn you can use history server but in the local mode the only option I am aware of is to keep application running.

Using addFile with pipe on a yarn cluster

I've been using pyspark with my YARN cluster with success. The work I'm
doing involves using the RDD's pipe command to send data through a binary
I've made. I can do this easily in pyspark like so (assuming 'sc' is
already defined):
sc.addFile("./dumb_prog")
t= sc.parallelize(range(10))
t.pipe("dumb_prog")
t.take(10) # Gives expected result
However, if I do the same thing in Scala, the pipe command gets a 'Cannot
run program "dumb_prog": error=2, No such file or directory' error. Here's
the code in the Scala shell:
sc.addFile("./dumb_prog")
val t = sc.parallelize(0 until 10)
val u = t.pipe("dumb_prog")
u.take(10)
Why does this only work in Python and not in Scala? Is there a way I can
get it to work in Scala?
Here is the full error message from the scala side:
[59/3965]
14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17
14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1
output partitions (allowLocal=true)
14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at
<console>:17)
14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List()
14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List()
14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe
at <console>:14), which has no missing parents
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with
curMem=7453, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.1 KB, free 265.4 MB)
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with
curMem=9589, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1389.0 B, free 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on 10.10.0.20:37574 (size: 1389.0 B, free: 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3
(PipedRDD[3] at pipe at <console>:14)
14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with
1 tasks
14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, SERVERNAME, PROCESS_LOCAL, 1201 bytes)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on SERVERNAME:57118 (size: 1389.0 B, free: 530.3 MB)
14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,
SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,
No such file or directory
java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
I ran into a similar issue in spark 1.3.0 in Yarn client mode. When I look in the app cache directory, the file never gets pushed to the executors even when using --files. But when I added the below, it did push to each executor:
sc.addFile("dumb_prog",true)
t.pipe("./dumb_prog")
I think it is a bug, but the above got me past the issue.