Apache Spark spilling to disk - scala

When running my program locally on a 16Gb MBP I get the following occurrences:
15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_3
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB)
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/04/10 20:07:50 INFO BlockManagerMaster: Updated info of block rdd_12_6
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 8 non-empty blocks out of 8 blocks
15/04/10 20:07:50 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 7.9 MB to disk (1 times so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (1 times so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 8.0 MB to disk (1 times so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (2 timess so far)
15/04/10 20:07:50 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.8 MB to disk (1 times so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.2 MB to disk (2 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.6 MB to disk (2 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (2 timess so far)
15/04/10 20:07:51 INFO ExternalAppendOnlyMap: Thread 61 spilling in-memory batch of 24.3 MB to disk (1 times so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (3 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.3 MB to disk (3 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.2 MB to disk (5 timess so far)
15/04/10 20:07:52 INFO ExternalAppendOnlyMap: Thread 67 spilling in-memory batch of 5.8 MB to disk (4 timess so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 63 spilling in-memory batch of 35.6 MB to disk (1 times so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 65 spilling in-memory batch of 5.0 MB to disk (4 timess so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 66 spilling in-memory batch of 5.0 MB to disk (5 timess so far)
15/04/10 20:07:53 INFO ExternalAppendOnlyMap: Thread 95 spilling in-memory batch of 5.0 MB to disk (6 timess so far)
15/04/10 20:07:53 INFO MemoryStore: ensureFreeSpace(872616) called with curMem=1345765155, maxMem=2061647216
15/04/10 20:07:53 INFO MemoryStore: Block rdd_12_2 stored as values in memory (estimated size 852.2 KB, free 681.9 MB)
15/04/10 20:07:53 INFO BlockManagerInfo: Added rdd_12_2 in memory on 192.168.1.4:60005 (size: 852.2 KB, free: 682.0 MB)
15/04/10 20:07:53 INFO BlockManagerMaster: Updated info of block rdd_12_2
My understanding is, is it has free memory, most of the memory is free in fact; given by:
15/04/10 20:07:50 INFO BlockManagerInfo: Added rdd_12_6 in memory on 192.168.1.4:60005 (size: 854.0 KB, free: 682.9 MB)
And yet it is spilling to disk? I'm using a ~265Mb dataset, so it really shouldn't need to be spilled to disk?
For what it's worth:
15/04/10 20:06:50 INFO MemoryStore: MemoryStore started with capacity 1966.1 MB
With all this spilling to disk it's taking ~5 minutes to run through my program once.
Why is this occurring?

I found that one of my columns had nulls throughout causing a skew which resulted in constant spills.

There are different memory arenas in play. For caching Spark uses spark.storage.memoryFraction (defaults to 60%) of the heap. This is what most of the "free memory" messages are about. It uses spark.shuffle.memoryFraction (defaults to 20%) of the heap for shuffle. I think this is what the spill messages are about. You can disable shuffle spill entirely by setting spark.shuffle.spill to false (defaults to true).
I don't know if this explains all of what you are seeing. See http://spark.apache.org/docs/latest/configuration.html for the description of all such parameters.

Related

Scala/Spark Select Column very slow

I am new to scala/spark (about a week now)
The following code is being run on my 8 core laptop, 64bit, Win10
The dataframe has 1700 rows.
ONE select takes over ten seconds.
Watching the console shows the main hang is at this point:
17/09/02 12:23:46 INFO FileSourceStrategy: Pruning directories with:
The Code
{
val major:String =name.substring(0,name.indexOf("_SCORE"))+"_idx1"
println(major)
val majors = dfMergedDroppedDeleted
.select(col(major))
.collect().toSeq
println(s"got majors ${majors.size}")
}
This should take milliseconds (based on experience with hibernate,r,mysql etc)
I am assuming there is something wrong with my configuration of spark?
Any suggestions would be most welcome.
The full console output up to the hang is as follows:
1637_1636_1716_idx1
1637_1636_1716_idx2
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 765
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 763
17/09/02 12:23:08 INFO BlockManagerInfo: Removed broadcast_51_piece0 on 192.168.0.13:62246 in memory (size: 113.7 KB, free: 901.6 MB)
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 761
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 764
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 762
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 766
17/09/02 12:23:08 INFO BlockManagerInfo: Removed broadcast_50_piece0 on 192.168.0.13:62246 in memory (size: 20.7 KB, free: 901.6 MB)
17/09/02 12:23:08 INFO FileSourceStrategy: Pruning directories with:
Putting the dataframe in cache makes a big difference.
val dfMergedDroppedDeletedCached:DataFrame=dfMergedDroppedDeleted.cache()
However, the caching process itself is slow, so this only pays off if you are performing multiple operations
UPDATE
Credit Ramesh Maharjan to who wrote in a comment:
the time consuming part is not select. select is distributed in nature and would be executed in every local data in executors. The time consuming part is the collect. Collect function collects all the data in the driver node. And that takes a lot of time. Thats why collect is always recommended not to be used and if necessary to use it the minimum.
I have changed the query to be as follows:
val majorstr:String = dfMergedDroppedDeletedCached.filter(dfMergedDroppedDeletedCached(major).isNotNull)
.select(col(major))
.limit(1)
.first().getString(0)
Not exactly Oracle speeds but much faster than using collect

RDD.map function hangs in Spark

I am trying to concatenate two spark dataframes of equal length like -
DF1 -
| A |
| 1 |
| 2 |
| 3 |
| 4 |
DF2 -
| B |
| a |
| b |
| c |
| d |
Result DF -
| A | B |
| 1 | a |
| 2 | b |
| 3 | c |
| 4 | d |
For this, I am using below code -
val combinedRow = df1.rdd.zip(df2.select("B").rdd). map({
case (df1Data, df2Data) => {
Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq)
}
})
val combinedschema = StructType(df1.schema.fields ++ df2.select("B").schema.fields)
val resultDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema)
But the code is not making any progress. Its not showing any exception also. Its just stuck.
Any suggestions what may be wrong here ? Thanks in advance.
EDIT -
Logs generated after successful execution of the latest statement.
[main] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 13.848847 ms
[broadcast-exchange-0] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 14.323824 ms
[broadcast-exchange-0] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_35 stored as values in memory (estimated size 1024.1 KB, free 871.5 MB)
[broadcast-exchange-0] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_35_piece0 stored as bytes in memory (estimated size 417.0 B, free 871.5 MB)
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_35_piece0 in memory on 192.168.20.181:38202 (size: 417.0 B, free: 872.9 MB)
[broadcast-exchange-0] INFO org.apache.spark.SparkContext - Created broadcast 35 from run at ThreadPoolExecutor.java:1142
[main] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 27.697751 ms
[main] INFO org.apache.spark.SparkContext - Starting job: show at Train.scala:180
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 19 (show at Train.scala:180) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 31 (show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 31 (MapPartitionsRDD[106] at show at Train.scala:180), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_36 stored as values in memory (estimated size 14.3 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_36_piece0 stored as bytes in memory (estimated size 6.4 KB, free 871.5 MB)
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_36_piece0 in memory on 192.168.20.181:38202 (size: 6.4 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 36 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 31 (MapPartitionsRDD[106] at show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 31.0 with 1 tasks
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 31.0 (TID 1267, localhost, executor driver, partition 0, PROCESS_LOCAL, 5961 bytes)
[Executor task launch worker for task 1267] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 31.0 (TID 1267)
[Executor task launch worker for task 1267] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 32.758147 ms
[Executor task launch worker for task 1267] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 20 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 32 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 32 (MapPartitionsRDD[110] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_37 stored as values in memory (estimated size 26.9 KB, free 871.4 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_37_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.4 MB)
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_37_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 37 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 32 (MapPartitionsRDD[110] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 32.0 with 1 tasks
[dispatcher-event-loop-2] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 32.0 (TID 1268, localhost, executor driver, partition 0, PROCESS_LOCAL, 5813 bytes)
[Executor task launch worker for task 1268] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 32.0 (TID 1268)
[Executor task launch worker for task 1268] INFO org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD - closed connection
[Executor task launch worker for task 1268] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 32.0 (TID 1268). 1979 bytes result sent to driver
[task-result-getter-3] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 32.0 (TID 1268) in 132 ms on localhost (executor driver) (1/1)
[task-result-getter-3] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 32.0, whose tasks have all completed, from pool
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 32 (head at Train.scala:161) finished in 0.128 s
[Executor task launch worker for task 1267] INFO org.apache.spark.scheduler.DAGScheduler - Job 20 finished: head at Train.scala:161, took 0.140223 s
[Executor task launch worker for task 1267] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 8.366053 ms
[Executor task launch worker for task 1267] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 31.0 (TID 1267). 1501 bytes result sent to driver
[task-result-getter-0] INFO org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 31.0 (TID 1267) in 393 ms on localhost (executor driver) (1/1)
[task-result-getter-0] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 31.0, whose tasks have all completed, from pool
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - ResultStage 31 (show at Train.scala:180) finished in 0.393 s
[main] INFO org.apache.spark.scheduler.DAGScheduler - Job 19 finished: show at Train.scala:180, took 0.413534 s
[main] INFO org.apache.spark.SparkContext - Starting job: show at Train.scala:180
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 21 (show at Train.scala:180) with 4 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 33 (show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 33 (MapPartitionsRDD[106] at show at Train.scala:180), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_38 stored as values in memory (estimated size 14.3 KB, free 871.4 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_38_piece0 stored as bytes in memory (estimated size 6.4 KB, free 871.4 MB)
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_38_piece0 in memory on 192.168.20.181:38202 (size: 6.4 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 38 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 4 missing tasks from ResultStage 33 (MapPartitionsRDD[106] at show at Train.scala:180)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 33.0 with 4 tasks
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 33.0 (TID 1269, localhost, executor driver, partition 1, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 1.0 in stage 33.0 (TID 1270, localhost, executor driver, partition 2, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 2.0 in stage 33.0 (TID 1271, localhost, executor driver, partition 3, PROCESS_LOCAL, 5961 bytes)
[dispatcher-event-loop-0] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 3.0 in stage 33.0 (TID 1272, localhost, executor driver, partition 4, PROCESS_LOCAL, 5961 bytes)
[Executor task launch worker for task 1269] INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 33.0 (TID 1269)
[Executor task launch worker for task 1271] INFO org.apache.spark.executor.Executor - Running task 2.0 in stage 33.0 (TID 1271)
[Executor task launch worker for task 1272] INFO org.apache.spark.executor.Executor - Running task 3.0 in stage 33.0 (TID 1272)
[Executor task launch worker for task 1270] INFO org.apache.spark.executor.Executor - Running task 1.0 in stage 33.0 (TID 1270)
[Executor task launch worker for task 1269] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 55.127045 ms
[Executor task launch worker for task 1271] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 22 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 34 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 34 (MapPartitionsRDD[117] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_39 stored as values in memory (estimated size 26.9 KB, free 871.4 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned shuffle 10
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31267
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31268
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25303
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_34_piece0 on 192.168.20.181:38202 in memory (size: 22.8 KB, free: 872.9 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25298
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25304
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 31269
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned shuffle 11
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25299
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25301
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25300
[dispatcher-event-loop-2] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_37_piece0 on 192.168.20.181:38202 in memory (size: 12.7 KB, free: 872.9 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_33_piece0 on 192.168.20.181:38202 in memory (size: 22.6 KB, free: 872.9 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25305
[dispatcher-event-loop-3] INFO org.apache.spark.storage.BlockManagerInfo - Removed broadcast_36_piece0 on 192.168.20.181:38202 in memory (size: 6.4 KB, free: 873.0 MB)
[Spark Context Cleaner] INFO org.apache.spark.ContextCleaner - Cleaned accumulator 25302
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_39_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.6 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_39_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 39 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 34 (MapPartitionsRDD[117] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 34.0 with 1 tasks
[Executor task launch worker for task 1272] INFO org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - Code generated in 92.57204 ms
[Executor task launch worker for task 1269] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 23 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 35 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 35 (MapPartitionsRDD[122] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_40 stored as values in memory (estimated size 26.9 KB, free 871.6 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_40_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_40_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 40 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 35 (MapPartitionsRDD[122] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 35.0 with 1 tasks
[Executor task launch worker for task 1270] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 24 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 36 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 36 (MapPartitionsRDD[124] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_41 stored as values in memory (estimated size 26.9 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_41_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-0] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_41_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 41 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 36 (MapPartitionsRDD[124] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 36.0 with 1 tasks
[Executor task launch worker for task 1272] INFO org.apache.spark.SparkContext - Starting job: head at Train.scala:161
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 25 (head at Train.scala:161) with 1 output partitions
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 37 (head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 37 (MapPartitionsRDD[126] at head at Train.scala:161), which has no missing parents
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_42 stored as values in memory (estimated size 26.9 KB, free 871.5 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.storage.memory.MemoryStore - Block broadcast_42_piece0 stored as bytes in memory (estimated size 12.7 KB, free 871.5 MB)
[dispatcher-event-loop-1] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_42_piece0 in memory on 192.168.20.181:38202 (size: 12.7 KB, free: 872.9 MB)
[dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 42 from broadcast at DAGScheduler.scala:996
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 37 (MapPartitionsRDD[126] at head at Train.scala:161)
[dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 37.0 with 1 tasks
And its stuck here.
I can't replicate any error in your code it should work fine in your case as well.
You can also simply join two dataframes with a id assigned to both dataframes
df1.withColumn("id", monotonically_increasing_id())
.join(df2.withColumn("id", monotonically_increasing_id()), "id").drop("id")
Hope this helps!
You can also use zipWithIndex on both RDDs:
val df1 = sc.parallelize(Seq("A", "1", "2", "3", "4")).toDF("A")
val df2 = sc.parallelize(Seq("B", "a", "b", "c", "d")).toDF("B")
val zip1 = df1.rdd.zipWithIndex.map { case (k, v) => (v, k.mkString)}
val zip2 = df2.rdd.zipWithIndex.map { case (k, v) => (v, k.mkString)}
zip1.join(zip2).map{ case (k, v) => v }.collect()

pyspark files loading errors shows

I am trying to load file to my spark and here is my code.
lines = sc.textFile('file:///Users/zhangqing198573/data/weblog_lab.csv')
and the error shows
17/03/04 22:16:32 INFO MemoryStore: Block broadcast_0 stored as values
in memory (estimated size 127.4 KB, free 127.4 KB) 17/03/04 22:16:32
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory
(estimated size 13.9 KB, free 141.3 KB) 17/03/04 22:16:32 INFO
BlockManagerInfo: Added broadcast_0_piece0 in memory on
localhost:53383 (size: 13.9 KB, free: 511.1 MB) 17/03/04 22:16:32 INFO
SparkContext: Created broadcast 0 from textFile at
NativeMethodAccessorImpl.java:-2.
I dont know whats wrong

Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded

I have the following code to converts the I read the data from my input files and create a pairedrdd, which is then converted to a Map for future lookups. I then map this broadcast variable. This is the map that is few GB. Is there a way to do collectAsMap() in a more efficient manner or to replace it with some other call?
val result_paired_rdd = prods_user_flattened.collectAsMap()
sc.broadcast(result_paired_rdd)
I get the following error. I also tried the following param: --executor-memory 7G with spark-submit command.
15/08/31 08:29:51 INFO BlockManagerInfo: Removed taskresult_48 on host3:48924 in memory (size: 11.4 MB, free: 3.6 GB)
15/08/31 08:29:51 INFO BlockManagerInfo: Added taskresult_50 in memory on host3:48924 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:29:52 INFO BlockManagerInfo: Added taskresult_51 in memory on host2:60182 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:30:02 ERROR Utils: Uncaught exception in thread task-result-getter-0
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
From the logs it looks like the driver is running out of memory.
For certain actions like collect, rdd data from all workers is transferred to the driver JVM.
Check your driver JVM settings
Avoid collecting so much data onto driver JVM

Apache spark message understanding

Request help to understand this message..
INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is **2202921** bytes
what does 2202921 mean here?
My job does a shuffle operation and while reading shuffle files from previous stage, it gives the message first and then after sometime it fails with below error:
14/11/12 11:09:46 WARN scheduler.TaskSetManager: Lost task 224.0 in stage 4.0 (TID 13938, ip-xx-xxx-xxx-xx.ec2.internal): FetchFailed(BlockManagerId(11, ip-xx-xxx-xxx-xx.ec2.internal, 48073, 0), shuffleId=2, mapId=7468, reduceId=224)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Marking Stage 4 (coalesce at <console>:49) as failed due to a fetch failure from Stage 3 (map at <console>:42)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Stage 4 (coalesce at <console>:49) failed in 213.446 s
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting Stage 3 (map at <console>:42) and Stage 4 (coalesce at <console>:49) due to fetch failure
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Executor lost: 11 (epoch 2)
14/11/12 11:09:46 INFO storage.BlockManagerMasterActor: Trying to remove executor 11 from BlockManagerMaster.
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Removed 11 successfully in removeExecutor
14/11/12 11:09:46 INFO scheduler.Stage: Stage 3 is now unavailable on executor 11 (11893/12836, false)
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Resubmitting failed stages
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting Stage 3 (MappedRDD[13] at map at <console>:42), which has no missing parents
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(25472) called with curMem=474762, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 24.9 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.MemoryStore: ensureFreeSpace(5160) called with curMem=500234, maxMem=11113699737
14/11/12 11:09:46 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.0 KB, free 10.3 GB)
14/11/12 11:09:46 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on ip-xx.ec2.internal:35571 (size: 5.0 KB, free: 10.4 GB)
14/11/12 11:09:46 INFO storage.BlockManagerMaster: Updated info of block broadcast_6_piece0
14/11/12 11:09:46 INFO scheduler.DAGScheduler: Submitting 943 missing tasks from Stage 3 (MappedRDD[13] at map at <console>:42)
14/11/12 11:09:46 INFO cluster.YarnClientClusterScheduler: Adding task set 3.1 with 943 tasks
My code looks like this,
(rdd1 ++ rdd2).map { t => ((t.id), t) }.groupByKey(1280).map {
case ((id), sequence) =>
val newrecord = sequence.maxBy {
case Fact(id, key, type, day, group, c_key, s_key, plan_id,size,
is_mom, customer_shipment_id, customer_shipment_item_id, asin, company_key, product_line_key, dw_last_updated, measures) => dw_last_updated.toLong
}
((PARTITION_KEY + "=" + newrecord.day.toString + "/part"), (newrecord))
}.coalesce(2048,true).saveAsTextFile("s3://myfolder/PT/test20nodes/")```
I derived 1280 as I have 20 nodes each having 32 cores. I derived it like 2*32*20.
For a Shuffle stage, it will create some ShuffleMapTasks which output the intermediate results to the disk. The location information will be stored in MapStatuses and sent to the MapOutputTrackerMaster(the driver).
Then when the next stage starts to run, it needs these location statuses. So executors will ask MapOutputTrackerMaster to fetch them. MapOutputTrackerMaster will serialize these status to bytes and send them to executors. Here is the size of these status in bytes.
These status will be sent via Akka. And Akka has a limitation to the max message size. You can set it via spark.akka.frameSize:
Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB. Increase this if your tasks need to send back large results to the driver (e.g. using collect() on a large dataset).
If the size is greater than spark.akka.frameSize, Akka will refuse to deliver the message and your job will fail. Therefore it can help you adjust spark.akka.frameSize to a best one.