pyspark files loading errors shows

pyspark files loading errors shows - pyspark

I am trying to load file to my spark and here is my code.
lines = sc.textFile('file:///Users/zhangqing198573/data/weblog_lab.csv')
and the error shows
17/03/04 22:16:32 INFO MemoryStore: Block broadcast_0 stored as values
in memory (estimated size 127.4 KB, free 127.4 KB) 17/03/04 22:16:32
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory
(estimated size 13.9 KB, free 141.3 KB) 17/03/04 22:16:32 INFO
BlockManagerInfo: Added broadcast_0_piece0 in memory on
localhost:53383 (size: 13.9 KB, free: 511.1 MB) 17/03/04 22:16:32 INFO
SparkContext: Created broadcast 0 from textFile at
NativeMethodAccessorImpl.java:-2.
I dont know whats wrong

Related

RDD.map function hangs in Spark

I am trying to concatenate two spark dataframes of equal length like -
DF1 -
| A |
| 1 |
| 2 |
| 3 |
| 4 |
DF2 -
| B |
| a |
| b |
| c |
| d |
Result DF -
| A | B |
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
For this, I am using below code -
val combinedRow = df1.rdd.zip(df2.select("B").rdd). map({
case (df1Data, df2Data) => {
Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq)
}
})
val combinedschema = StructType(df1.schema.fields ++ df2.select("B").schema.fields)
val resultDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema)
But the code is not making any progress. Its not showing any exception also. Its just stuck.
Any suggestions what may be wrong here ? Thanks in advance.
EDIT -
Logs generated after successful execution of the latest statement.
[main] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 13.848847 ms
[broadcast-exchange-0] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 14.323824 ms
[broadcast-exchange-0] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_35 stored as values in memory (estimated size 1024.1 KB, free 871.5 MB)
[broadcast-exchange-0] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_35_piece0 stored as bytes in memory (estimated size 417.0 B, free 871.5 MB)
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_35_piece0 in memory on 192.168.20.181:38202 (size: 417.0 B, free: 872.9 MB)
[broadcast-exchange-0] INFO org.apache.spark.SparkContext - Created broadcast 35 from run at ThreadPoolExecutor.java:1142
[main] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 27.697751 ms
[main] INFO org.apache.spark.SparkContext - Starting job: show at Train.scala:180
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 19 (show at Train.scala:180) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 31 (show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 31 (MapPartitionsRDD[106] at show at Train.scala:180), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_36 stored as values in memory (estimated size 14.3 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_36_piece0 stored as bytes in memory (estimated size 6.4 KB, free 871.5 MB)
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_36_piece0 in memory on 192.168.20.181:38202 (size: 6.4 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 36 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 31 (MapPartitionsRDD[106] at show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 31.0 with 1 tasks
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 31.0 (TID 1267, localhost, executor driver, partition 0, PROCESS_LOCAL, 5961 bytes)
[Executor task launch worker for task 1267] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 31.0 (TID 1267)
[Executor task launch worker for task 1267] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 32.758147 ms
[Executor task launch worker for task 1267] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 20 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 32 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 32 (MapPartitionsRDD[110] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_37 stored as values in memory (estimated size 26.9 KB, free 871.4 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_37_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.4 MB)
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_37_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 37 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 32 (MapPartitionsRDD[110] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 32.0 with 1 tasks
[dispatcher-event-loop-2] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 32.0 (TID 1268, localhost, executor driver, partition 0, PROCESS_LOCAL, 5813 bytes)
[Executor task launch worker for task 1268] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 32.0 (TID 1268)
[Executor task launch worker for task 1268] INFO org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD - closed connection
[Executor task launch worker for task 1268] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 32.0 (TID 1268). 1979 bytes result sent to driver
[task-result-getter-3] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 32.0 (TID 1268) in 132 ms on localhost (executor driver) (1/1)
[task-result-getter-3] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 32.0, whose tasks have all completed, from pool
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 32 (head at Train.scala:161) finished in 0.128 s
[Executor task launch worker for task 1267] INFO org.apache.spark.scheduler.DAGScheduler - Job 20 finished: head at Train.scala:161, took 0.140223 s
[Executor task launch worker for task 1267] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 8.366053 ms
[Executor task launch worker for task 1267] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 31.0 (TID 1267). 1501 bytes result sent to driver
[task-result-getter-0] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 31.0 (TID 1267) in 393 ms on localhost (executor driver) (1/1)
[task-result-getter-0] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 31.0, whose tasks have all completed, from pool
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 31 (show at Train.scala:180) finished in 0.393 s
[main] INFO org.apache.spark.scheduler.DAGScheduler - Job 19 finished: show at Train.scala:180, took 0.413534 s
[main] INFO org.apache.spark.SparkContext - Starting job: show at Train.scala:180
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 21 (show at Train.scala:180) with 4 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 33 (show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 33 (MapPartitionsRDD[106] at show at Train.scala:180), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_38 stored as values in memory (estimated size 14.3 KB, free 871.4 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_38_piece0 stored as bytes in memory (estimated size 6.4 KB, free 871.4 MB)
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_38_piece0 in memory on 192.168.20.181:38202 (size: 6.4 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 38 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 4 missing tasks from ResultStage 33 (MapPartitionsRDD[106] at show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 33.0 with 4 tasks
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 33.0 (TID 1269, localhost, executor driver, partition 1, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 1.0 in stage 33.0 (TID 1270, localhost, executor driver, partition 2, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 2.0 in stage 33.0 (TID 1271, localhost, executor driver, partition 3, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 3.0 in stage 33.0 (TID 1272, localhost, executor driver, partition 4, PROCESS_LOCAL, 5961 bytes)
[Executor task launch worker for task 1269] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 33.0 (TID 1269)
[Executor task launch worker for task 1271] INFO org.apache.spark.executor.Executor - Running task 2.0 in stage 33.0 (TID 1271)
[Executor task launch worker for task 1272] INFO org.apache.spark.executor.Executor - Running task 3.0 in stage 33.0 (TID 1272)
[Executor task launch worker for task 1270] INFO org.apache.spark.executor.Executor - Running task 1.0 in stage 33.0 (TID 1270)
[Executor task launch worker for task 1269] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 55.127045 ms
[Executor task launch worker for task 1271] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 22 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 34 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 34 (MapPartitionsRDD[117] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_39 stored as values in memory (estimated size 26.9 KB, free 871.4 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned shuffle 10
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31267
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31268
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25303
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_34_piece0 on 192.168.20.181:38202 in memory (size: 22.8 KB, free: 872.9 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25298
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25304
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31269
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned shuffle 11
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25299
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25301
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25300
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_37_piece0 on 192.168.20.181:38202 in memory (size: 12.7 KB, free: 872.9 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_33_piece0 on 192.168.20.181:38202 in memory (size: 22.6 KB, free: 872.9 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25305
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_36_piece0 on 192.168.20.181:38202 in memory (size: 6.4 KB, free: 873.0 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25302
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_39_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.6 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_39_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 39 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 34 (MapPartitionsRDD[117] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 34.0 with 1 tasks
[Executor task launch worker for task 1272] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 92.57204 ms
[Executor task launch worker for task 1269] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 23 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 35 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 35 (MapPartitionsRDD[122] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_40 stored as values in memory (estimated size 26.9 KB, free 871.6 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_40_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_40_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 40 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 35 (MapPartitionsRDD[122] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 35.0 with 1 tasks
[Executor task launch worker for task 1270] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 24 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 36 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 36 (MapPartitionsRDD[124] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_41 stored as values in memory (estimated size 26.9 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_41_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_41_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 41 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 36 (MapPartitionsRDD[124] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 36.0 with 1 tasks
[Executor task launch worker for task 1272] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 25 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 37 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 37 (MapPartitionsRDD[126] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_42 stored as values in memory (estimated size 26.9 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_42_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_42_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 42 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 37 (MapPartitionsRDD[126] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 37.0 with 1 tasks
And its stuck here.

I can't replicate any error in your code it should work fine in your case as well.
You can also simply join two dataframes with a id assigned to both dataframes
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), "id").drop("id")
Hope this helps!

You can also use zipWithIndex on both RDDs:
val df1 = sc.parallelize(Seq("A", "1", "2", "3", "4")).toDF("A")
val df2 = sc.parallelize(Seq("B", "a", "b", "c", "d")).toDF("B")
val zip1 = df1.rdd.zipWithIndex.map { case (k, v) => (v, k.mkString)}
val zip2 = df2.rdd.zipWithIndex.map { case (k, v) => (v, k.mkString)}
zip1.join(zip2).map{ case (k, v) => v }.collect()

UnknownHostException with Mesos + Spark and custom Jar

I am receiving an UnknownHostException when running a custom jar with Spark on Mesos. The issue does not happen when running spark-shell.
My spark-env.sh contains the following:
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
export HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop/
My spark-defaults.conf contains the following:
spark.master mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
spark.mesos.executor.home /spark-1.5.0-bin-hadoop2.6/
These settings are on all masters and slaves.
Starting spark-shell as follows and running the following line works correctly:
/spark-1.5.0-bin-hadoop2.6/bin/spark-shell
sc.textFile("/tmp/Input").collect.foreach(println)
Log for spark-shell:
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(88528) called with curMem=0, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 86.5 KB, free 530.2 MB)
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(20236) called with curMem=88528, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.8 KB, free 530.2 MB)
15/09/28 20:04:49 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.31.21.104:49048 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:49 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:22
15/09/28 20:04:49 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/28 20:04:49 INFO spark.SparkContext: Starting job: collect at <console>:22
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:22) with 3 output partitions
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Final stage: ResultStage 0(collect at <console>:22)
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Missing parents: List()
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:22), which has no missing parents
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(3120) called with curMem=108764, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.0 KB, free 530.2 MB)
15/09/28 20:04:49 INFO storage.MemoryStore: ensureFreeSpace(1784) called with curMem=111884, maxMem=556038881
15/09/28 20:04:49 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1784.0 B, free 530.2 MB)
15/09/28 20:04:49 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.21.104:49048 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:49 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:861
15/09/28 20:04:49 INFO scheduler.DAGScheduler: Submitting 3 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at textFile at <console>:22)
15/09/28 20:04:49 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
15/09/28 20:04:49 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-31-37-82.us-west-2.compute.internal, NODE_LOCAL, 2142 bytes)
15/09/28 20:04:49 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-31-21-104.us-west-2.compute.internal, NODE_LOCAL, 2142 bytes)
15/09/28 20:04:49 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, ip-172-31-4-4.us-west-2.compute.internal, NODE_LOCAL, 2142 bytes)
15/09/28 20:04:52 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-4-4.us-west-2.compute.internal:50648 with 530.3 MB RAM, BlockManagerId(20150928-190245-1358962604-5050-11297-S2, ip-172-31-4-4.us-west-2.compute.internal, 50648)
15/09/28 20:04:52 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-37-82.us-west-2.compute.internal:52624 with 530.3 MB RAM, BlockManagerId(20150928-190245-1358962604-5050-11297-S1, ip-172-31-37-82.us-west-2.compute.internal, 52624)
15/09/28 20:04:52 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-21-104.us-west-2.compute.internal:56628 with 530.3 MB RAM, BlockManagerId(20150928-190245-1358962604-5050-11297-S0, ip-172-31-21-104.us-west-2.compute.internal, 56628)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-37-82.us-west-2.compute.internal:52624 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-21-104.us-west-2.compute.internal:56628 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-4-4.us-west-2.compute.internal:50648 (size: 1784.0 B, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-37-82.us-west-2.compute.internal:52624 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-21-104.us-west-2.compute.internal:56628 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:52 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-4-4.us-west-2.compute.internal:50648 (size: 19.8 KB, free: 530.3 MB)
15/09/28 20:04:53 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3907 ms on ip-172-31-37-82.us-west-2.compute.internal (1/3)
15/09/28 20:04:53 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 3884 ms on ip-172-31-4-4.us-west-2.compute.internal (2/3)
15/09/28 20:04:53 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3907 ms on ip-172-31-21-104.us-west-2.compute.internal (3/3)
15/09/28 20:04:53 INFO scheduler.DAGScheduler: ResultStage 0 (collect at <console>:22) finished in 3.940 s
15/09/28 20:04:53 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/09/28 20:04:53 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:22, took 4.019454 s
pepsi
cocacola
The following sample code compiled into a Jar fails
Sample code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
sc.textFile("/tmp/Input").collect.foreach(println)
}
}
Run via:
/spark-1.5.0-bin-hadoop2.6/bin/spark-submit --class "SimpleApp" /home/hdfs/test_2.10-0.1.jar
Log for spark-submit:
java.lang.IllegalArgumentException: java.net.UnknownHostException: affinio
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:601)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1007)
at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1007)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
at scala.Option.map(Option.scala:145)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:220)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: affinio
... 35 more
hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>affinio</value>
</property>
<property>
<name>dfs.ha.namenodes.affinio</name>
<value>nn1,nn2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.affinio.nn1</name>
<value>172.31.16.81:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.affinio.nn2</name>
<value>172.31.32.81:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.affinio.nn1</name>
<value>172.31.16.81:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.affinio.nn2</name>
<value>172.31.32.81:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>file:///nfs/dfs/ha-name-dir-shared</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.affinio</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/hdfs/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/namenode</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>268435456</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hdfs</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>172.31.16.81:2181,172.31.32.81:2181,172.31.0.81:2181</value>
</property>
</configuration>
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://affinio</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
</configuration>
spark-shell conf.toDebugString
spark.app.id=20150929-173220-1361059756-5050-16026-0005
spark.app.name=Spark shell
spark.driver.host=172.31.25.67
spark.driver.port=37613
spark.executor.id=driver
spark.externalBlockStore.folderName=spark-d4bf255f-f1f3-4026-83bf-b377a24f5f2c
spark.fileserver.uri=http://172.31.25.67:54526
spark.jars=
spark.master=mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
spark.mesos.executor.home=/spark-1.5.0-bin-hadoop2.6/
spark.repl.class.uri=http://172.31.25.67:45553
spark.submit.deployMode=client
spark-submit conf.toDebugString
spark.app.id=20150929-173220-1361059756-5050-16026-0004
spark.app.name=Simple Application
spark.driver.host=172.31.25.67
spark.driver.port=47968
spark.executor.id=driver
spark.externalBlockStore.folderName=spark-846de0d9-8bb1-414b-8b81-f2d6646a58d3
spark.fileserver.uri=http://172.31.25.67:45283
spark.jars=file:/home/hdfs/./test_2.10-0.1.jar
spark.master=mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
spark.mesos.executor.home=/spark-1.5.0-bin-hadoop2.6/
spark.submit.deployMode=client
I am able to make it work if I run it as follows:
spark-submit --files /hadoop-2.7.1/etc/hadoop/hdfs-site.xml,/hadoop-2.7.1/etc/hadoop/core-site.xml ./test_2.10-0.1.jar
So, the configurations are not being loaded by default, even though I set HADOOP_CONF_DIR on all machines to /hadoop-2.7.1/etc/hadoop/ in the /spark-1.5.0-bin-hadoop2.6/conf/spark-env.sh as well as in the user profile settings:
cat /etc/profile.d/hadoop.sh
# Set path for hadoop
export HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop/
export PATH=$PATH:/hadoop-2.7.1/bin
Output of the --verbose switch
System properties:
spark.local.dir -> /data/spark/
SPARK_SUBMIT -> true
spark.files -> file:///hadoop-2.7.1/etc/hadoop/hdfs-site.xml,file:///hadoop-2.7.1/etc/hadoop/core-site.xml
spark.app.name -> SimpleApp
spark.jars -> file:/home/hdfs/./test_2.10-0.1.jar
spark.submit.deployMode -> client
spark.mesos.executor.home -> /spark-1.5.0-bin-hadoop2.6
spark.master -> mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
Classpath elements:
file:/home/hdfs/./test_2.10-0.1.jar
I also made the app print the environment variables from the executor
sc.parallelize(Array(1)).flatMap( v=>System.getenv ).collect.foreach(v=>println(s"${v._1}=${v._2}"))
Output:
LIBPROCESS_PORT=0
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
SPARK_EXECUTOR_MEMORY=1024m
SHLVL=1
MESOS_EXECUTOR_ID=20150930-115952-1361059756-5050-15990-S1
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
MESOS_DIRECTORY=/data/slaves/20150930-115952-1361059756-5050-15990-S1/frameworks/20150930-115952-1361059756-5050-15990-0008/executors/20150930-115952-1361059756-5050-15990-S1/runs/2baa786a-be89-4823-a248-bb35034bb2fa
MESOS_SLAVE_PID=slave(1)#172.31.32.118:5051
_SPARK_ASSEMBLY=/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
SPARK_HOME=/spark-1.5.0-bin-hadoop2.6
MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-0.24.0.so
SPARK_SCALA_VERSION=2.10
SPARK_USER=hdfs
PWD=/data/slaves/20150930-115952-1361059756-5050-15990-S1/frameworks/20150930-115952-1361059756-5050-15990-0008/executors/20150930-115952-1361059756-5050-15990-S1/runs/2baa786a-be89-4823-a248-bb35034bb2fa
SPARK_ENV_LOADED=1
MESOS_FRAMEWORK_ID=20150930-115952-1361059756-5050-15990-0008
MESOS_SLAVE_ID=20150930-115952-1361059756-5050-15990-S1
MESOS_CHECKPOINT=0
HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop/
SPARK_EXECUTOR_OPTS=
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
So we can see the executors have HADOOP_CONF_DIR in their environment, but it's still not working without using the spark.files
UPDATE:
Downgrading to spark-1.3.1 and the issue goes away. Something in spark-1.5 breaks the classpath
Spark-1.3.1 outputs:
System properties:
SPARK_SUBMIT -> true
spark.app.name -> SimpleApp
spark.jars -> file:/home/hdfs/./test_2.10-0.1.jar
spark.mesos.executor.home -> /spark-1.3.1-bin-hadoop2.6
spark.master -> mesos://zk://172.31.0.81:2181,172.31.16.81:2181,172.31.32.81:2181/mesos
Classpath elements:
file:/home/hdfs/./test_2.10-0.1.jar
Executor Environment:
LIBPROCESS_PORT=0
MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
SPARK_EXECUTOR_MEMORY=512m
SHLVL=1
MESOS_EXECUTOR_ID=20150930-115952-1361059756-5050-15990-S2
CLASSPATH=/spark-1.3.1-bin-hadoop2.6/conf:/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/spark-1.3.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/spark-1.3.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/spark-1.3.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/hadoop-2.7.1/etc/hadoop
XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
MESOS_DIRECTORY=/data/slaves/20150930-115952-1361059756-5050-15990-S2/frameworks/20150930-115952-1361059756-5050-15990-0013/executors/20150930-115952-1361059756-5050-15990-S2/runs/23c38710-14d7-4550-b3f7-2879576ce1d2
MESOS_SLAVE_PID=slave(1)#172.31.18.189:5051
PYTHONPATH=/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip:/spark-1.3.1-bin-hadoop2.6/python:
SPARK_HOME=/spark-1.3.1-bin-hadoop2.6
SPARK_CONF_DIR=/spark-1.3.1-bin-hadoop2.6/conf
MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos-0.24.0.so
SPARK_SCALA_VERSION=2.10
SPARK_USER=hdfs
PWD=/data/slaves/20150930-115952-1361059756-5050-15990-S2/frameworks/20150930-115952-1361059756-5050-15990-0013/executors/20150930-115952-1361059756-5050-15990-S2/runs/23c38710-14d7-4550-b3f7-2879576ce1d2
SPARK_ENV_LOADED=1
MESOS_FRAMEWORK_ID=20150930-115952-1361059756-5050-15990-0013
MESOS_SLAVE_ID=20150930-115952-1361059756-5050-15990-S2
MESOS_CHECKPOINT=0
HADOOP_CONF_DIR=/hadoop-2.7.1/etc/hadoop
SPARK_EXECUTOR_OPTS=
NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat

Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded

I have the following code to converts the I read the data from my input files and create a pairedrdd, which is then converted to a Map for future lookups. I then map this broadcast variable. This is the map that is few GB. Is there a way to do collectAsMap() in a more efficient manner or to replace it with some other call?
val result_paired_rdd = prods_user_flattened.collectAsMap()
sc.broadcast(result_paired_rdd)
I get the following error. I also tried the following param: --executor-memory 7G with spark-submit command.
15/08/31 08:29:51 INFO BlockManagerInfo: Removed taskresult_48 on host3:48924 in memory (size: 11.4 MB, free: 3.6 GB)
15/08/31 08:29:51 INFO BlockManagerInfo: Added taskresult_50 in memory on host3:48924 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:29:52 INFO BlockManagerInfo: Added taskresult_51 in memory on host2:60182 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:30:02 ERROR Utils: Uncaught exception in thread task-result-getter-0
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

From the logs it looks like the driver is running out of memory.
For certain actions like collect, rdd data from all workers is transferred to the driver JVM.
Check your driver JVM settings
Avoid collecting so much data onto driver JVM

Apache Spark spilling to disk

When running my program locally on a 16Gb MBP I get the following occurrences:
15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_3
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB)
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_6
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 7.9 MB to disk (1 times so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (1 times so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 8.0 MB to disk (1 times so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (2 timess so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.8 MB to disk (1 times so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.2 MB to disk (2 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.6 MB to disk (2 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (2 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 61 spilling in-memory batch of 24.3 MB to disk (1 times so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.3 MB to disk (3 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.2 MB to disk (5 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.8 MB to disk (4 timess so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory batch of 35.6 MB to disk (1 times so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (5 timess so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (6 timess so far)
15/04/10 20:07:53 INFO MemoryStore: ensureFreeSpace(872616) called with curMem=1345765155, maxMem=2061647216
15/04/10 20:07:53 INFO MemoryStore: Block rdd_12_2 stored as values in memory (estimated size 852.2 KB, free 681.9 MB)
15/04/10 20:07:53 INFO BlockManagerInfo: Added rdd_12_2 in memory on 192.168.1.4:60005 (size: 852.2 KB, free: 682.0 MB)
15/04/10 20:07:53 INFO BlockManagerMaster: Updated info of block rdd_12_2
My understanding is, is it has free memory, most of the memory is free in fact; given by:
15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB)
And yet it is spilling to disk? I'm using a ~265Mb dataset, so it really shouldn't need to be spilled to disk?
For what it's worth:
15/04/10 20:06:50 INFO MemoryStore: MemoryStore started with capacity 1966.1 MB
With all this spilling to disk it's taking ~5 minutes to run through my program once.
Why is this occurring?

I found that one of my columns had nulls throughout causing a skew which resulted in constant spills.

There are different memory arenas in play. For caching Spark uses spark.storage.memoryFraction (defaults to 60%) of the heap. This is what most of the "free memory" messages are about. It uses spark.shuffle.memoryFraction (defaults to 20%) of the heap for shuffle. I think this is what the spill messages are about. You can disable shuffle spill entirely by setting spark.shuffle.spill to false (defaults to true).
I don't know if this explains all of what you are seeing. See http://spark.apache.org/docs/latest/configuration.html for the description of all such parameters.

Using addFile with pipe on a yarn cluster

I've been using pyspark with my YARN cluster with success. The work I'm
doing involves using the RDD's pipe command to send data through a binary
I've made. I can do this easily in pyspark like so (assuming 'sc' is
already defined):
sc.addFile("./dumb_prog")
t= sc.parallelize(range(10))
t.pipe("dumb_prog")
t.take(10) # Gives expected result
However, if I do the same thing in Scala, the pipe command gets a 'Cannot
run program "dumb_prog": error=2, No such file or directory' error. Here's
the code in the Scala shell:
sc.addFile("./dumb_prog")
val t = sc.parallelize(0 until 10)
val u = t.pipe("dumb_prog")
u.take(10)
Why does this only work in Python and not in Scala? Is there a way I can
get it to work in Scala?
Here is the full error message from the scala side:
[59/3965]
14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17
14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1
output partitions (allowLocal=true)
14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at
<console>:17)
14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List()
14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List()
14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe
at <console>:14), which has no missing parents
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with
curMem=7453, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.1 KB, free 265.4 MB)
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with
curMem=9589, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1389.0 B, free 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on 10.10.0.20:37574 (size: 1389.0 B, free: 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3
(PipedRDD[3] at pipe at <console>:14)
14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with
1 tasks
14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, SERVERNAME, PROCESS_LOCAL, 1201 bytes)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on SERVERNAME:57118 (size: 1389.0 B, free: 530.3 MB)
14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,
SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,
No such file or directory
java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

I ran into a similar issue in spark 1.3.0 in Yarn client mode. When I look in the app cache directory, the file never gets pushed to the executors even when using --files. But when I added the below, it did push to each executor:
sc.addFile("dumb_prog",true)
t.pipe("./dumb_prog")
I think it is a bug, but the above got me past the issue.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark files loading errors shows - pyspark

Related

RDD.map function hangs in Spark

UnknownHostException with Mesos + Spark and custom Jar

Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded

Apache Spark spilling to disk

Using addFile with pipe on a yarn cluster

Categories

Resources