Spark: java.lang.OutOfMemoryError: GC overhead limit exceeded - scala

I have the following code to converts the I read the data from my input files and create a pairedrdd, which is then converted to a Map for future lookups. I then map this broadcast variable. This is the map that is few GB. Is there a way to do collectAsMap() in a more efficient manner or to replace it with some other call?
val result_paired_rdd = prods_user_flattened.collectAsMap()
sc.broadcast(result_paired_rdd)
I get the following error. I also tried the following param: --executor-memory 7G with spark-submit command.
15/08/31 08:29:51 INFO BlockManagerInfo: Removed taskresult_48 on host3:48924 in memory (size: 11.4 MB, free: 3.6 GB)
15/08/31 08:29:51 INFO BlockManagerInfo: Added taskresult_50 in memory on host3:48924 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:29:52 INFO BlockManagerInfo: Added taskresult_51 in memory on host2:60182 (size: 11.6 MB, free: 3.6 GB)
15/08/31 08:30:02 ERROR Utils: Uncaught exception in thread task-result-getter-0
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at com.esotericsoftware.kryo.io.Input.readString(Input.java:448)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:157)
at com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.read(DefaultSerializers.java:146)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:173)
at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
at org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:621)
at org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:379)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:51)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:50)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

From the logs it looks like the driver is running out of memory.
For certain actions like collect, rdd data from all workers is transferred to the driver JVM.
Check your driver JVM settings
Avoid collecting so much data onto driver JVM

Related

Losing executors when saving parquet file

I have loaded a dataset which is just around ~ 20 GB in size - the cluster has ~ 1TB available so memory shouldn't be an issue imho.
It is no problem for me to save the original data which consists only of strings:
df_data.write.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
However, as I transform the data:
df_transformed = df_data.drop('bri').join(
df_data[['docId', 'bri']].rdd\
.map(lambda x: (x.docId, json.loads(x.bri))
if x.bri is not None else (x.docId, dict()))\
.toDF()\
.withColumnRenamed('_1', 'docId')\
.withColumnRenamed('_2', 'bri'),
['dokumentId']
)
and then save it:
df_transformed.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
The log output will tell me that the memory limit was exceeded:
18/03/08 10:23:09 WARN TaskSetManager: Lost task 17.0 in stage 18.3 (TID 2866, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 29.0 in stage 18.3 (TID 2878, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 65.0 in stage 18.3 (TID 2914, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I'm not quite sure what the problem is. Even setting the executor's memory to 60GB RAM each does not solve the problem.
So, obviously the problem comes with the transformation. Any idea what exactly causes this problem?

Scala/Spark Select Column very slow

I am new to scala/spark (about a week now)
The following code is being run on my 8 core laptop, 64bit, Win10
The dataframe has 1700 rows.
ONE select takes over ten seconds.
Watching the console shows the main hang is at this point:
17/09/02 12:23:46 INFO FileSourceStrategy: Pruning directories with:
The Code
{
val major:String =name.substring(0,name.indexOf("_SCORE"))+"_idx1"
println(major)
val majors = dfMergedDroppedDeleted
.select(col(major))
.collect().toSeq
println(s"got majors ${majors.size}")
}
This should take milliseconds (based on experience with hibernate,r,mysql etc)
I am assuming there is something wrong with my configuration of spark?
Any suggestions would be most welcome.
The full console output up to the hang is as follows:
1637_1636_1716_idx1
1637_1636_1716_idx2
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 765
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 763
17/09/02 12:23:08 INFO BlockManagerInfo: Removed broadcast_51_piece0 on 192.168.0.13:62246 in memory (size: 113.7 KB, free: 901.6 MB)
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 761
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 764
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 762
17/09/02 12:23:08 INFO ContextCleaner: Cleaned accumulator 766
17/09/02 12:23:08 INFO BlockManagerInfo: Removed broadcast_50_piece0 on 192.168.0.13:62246 in memory (size: 20.7 KB, free: 901.6 MB)
17/09/02 12:23:08 INFO FileSourceStrategy: Pruning directories with:
Putting the dataframe in cache makes a big difference.
val dfMergedDroppedDeletedCached:DataFrame=dfMergedDroppedDeleted.cache()
However, the caching process itself is slow, so this only pays off if you are performing multiple operations
UPDATE
Credit Ramesh Maharjan to who wrote in a comment:
the time consuming part is not select. select is distributed in nature and would be executed in every local data in executors. The time consuming part is the collect. Collect function collects all the data in the driver node. And that takes a lot of time. Thats why collect is always recommended not to be used and if necessary to use it the minimum.
I have changed the query to be as follows:
val majorstr:String = dfMergedDroppedDeletedCached.filter(dfMergedDroppedDeletedCached(major).isNotNull)
.select(col(major))
.limit(1)
.first().getString(0)
Not exactly Oracle speeds but much faster than using collect

Not enough space to cache broadcast in memory

I use HDInsight and I run an iterative algorithm with Spark over a binary tree. I broadcast in each node and execute only one Action "isEmtpy()" in the root node.
For a large graph I run into the following error:
17/05/08 00:05:36 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block broadcast_1066 in memory.
17/05/08 00:05:36 WARN MemoryStore: Not enough space to cache broadcast_1066 in memory! (computed 384.0 B so far)
17/05/08 00:05:36 INFO MemoryStore: Memory use = 479.2 KB (blocks) + 0.0 B (scratch space shared across 0 tasks(s)) = 479.2 KB. Storage limit = 564.2 KB.
17/05/08 00:05:36 ERROR Utils: Exception encountered
java.util.NoSuchElementException
at org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:697)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:178)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:178)
at scala.Option.map(Option.scala:146)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:174)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:89)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:72)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/08 00:05:36 INFO ExecutionMemoryPool: TID 84793 waiting for at least 1/2N of on-heap execution pool to be free
My broadcast variables arent large... I run the application on 5 worker. this error occurs in an executor.
Thanks!

pyspark files loading errors shows

I am trying to load file to my spark and here is my code.
lines = sc.textFile('file:///Users/zhangqing198573/data/weblog_lab.csv')
and the error shows
17/03/04 22:16:32 INFO MemoryStore: Block broadcast_0 stored as values
in memory (estimated size 127.4 KB, free 127.4 KB) 17/03/04 22:16:32
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory
(estimated size 13.9 KB, free 141.3 KB) 17/03/04 22:16:32 INFO
BlockManagerInfo: Added broadcast_0_piece0 in memory on
localhost:53383 (size: 13.9 KB, free: 511.1 MB) 17/03/04 22:16:32 INFO
SparkContext: Created broadcast 0 from textFile at
NativeMethodAccessorImpl.java:-2.
I dont know whats wrong

Spark SQL + Cassandra: bad performance

I'm just starting using Spark SQL + Cassandra, and probably am missing something important, but one simple query takes ~45 seconds. I'm using cassanda-spark-connector library, and run the local web server which also hosts the Spark. So my setup is roughly like this:
In sbt:
"org.apache.spark" %% "spark-core" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"org.apache.spark" %% "spark-sql" % "1.4.1" excludeAll(ExclusionRule(organization = "org.slf4j")),
"com.datastax.spark" %% "spark-cassandra-connector" % "1.4.0-M3" excludeAll(ExclusionRule(organization = "org.slf4j"))
In code I have a singleton that hosts SparkContext and CassandraSQLContetx. It's then called from the servlet. Here's how the singleton code looks like:
object SparkModel {
val conf =
new SparkConf()
.setAppName("core")
.setMaster("local")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(conf)
val sqlC = new CassandraSQLContext(sc)
sqlC.setKeyspace("core")
val df: DataFrame = sqlC.cassandraSql(
"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
}
And here how I use it:
get("/spark") {
SparkModel.df.collect().map(r => TrackingEvent(r.getString(0), r.getString(1), r.getString(2))).toList
}
Cassandra, Spark and the web app run on the same host in virtual machine on my Macbook Pro with decent specs. Cassandra queries by themselves take 10-20 milliseconds.
When I call this endpoint for the first time, it takes 70-80 seconds to return the result. Subsequent queries take ~45 seconds. The log of the subsequent operation looks like this:
12:48:50 INFO org.apache.spark.SparkContext - Starting job: collect at V1Servlet.scala:1146
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Got job 1 (collect at V1Servlet.scala:1146) with 1 output partitions (allowLocal=false)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Final stage: ResultStage 1(collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Parents of final stage: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Missing parents: List()
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146), which has no missing parents
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(18696) called with curMem=26661, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1 stored as values in memory (estimated size 18.3 KB, free 787.3 MB)
12:48:50 INFO org.apache.spark.storage.MemoryStore - ensureFreeSpace(8345) called with curMem=45357, maxMem=825564856
12:48:50 INFO org.apache.spark.storage.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 8.1 KB, free 787.3 MB)
12:48:50 INFO o.a.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on localhost:56289 (size: 8.1 KB, free: 787.3 MB)
12:48:50 INFO org.apache.spark.SparkContext - Created broadcast 1 from broadcast at DAGScheduler.scala:874
12:48:50 INFO o.a.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[29] at collect at V1Servlet.scala:1146)
12:48:50 INFO o.a.s.scheduler.TaskSchedulerImpl - Adding task set 1.0 with 1 tasks
12:48:50 INFO o.a.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 1.0 (TID 1, localhost, NODE_LOCAL, 59413 bytes)
12:48:50 INFO org.apache.spark.executor.Executor - Running task 0.0 in stage 1.0 (TID 1)
12:48:50 INFO com.datastax.driver.core.Cluster - New Cassandra host localhost/127.0.0.1:9042 added
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
12:49:35 INFO o.a.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 45199 ms on localhost (1/1)
12:49:35 INFO o.a.s.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool
12:49:35 INFO o.a.spark.scheduler.DAGScheduler - ResultStage 1 (collect at V1Servlet.scala:1146) finished in 45.199 s
As you can see from the log, the longest pauses are between these 3 lines (21 + 24 seconds):
12:48:50 INFO c.d.s.c.cql.CassandraConnector - Connected to Cassandra cluster: Super Cluster
12:49:11 INFO o.a.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:56289 in memory (size: 8.0 KB, free: 787.3 MB)
12:49:35 INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 6124 bytes result sent to driver
Apparently, I'm doing something wrong. What's that? How can I improve this?
EDIT: Important addition: the size of the tables is tiny (~200 entries for tracking_events, ~20 for customers), so reading them in their whole into memory shouldn't take any significant time. And it's a local Cassandra installation, no cluster, no networking is involved.
"SELECT email, target_entity_id, target_entity_type " +
"FROM tracking_events " +
"LEFT JOIN customers " +
"WHERE entity_type = 'User' AND entity_id = customer_id")
This query will read all of the data from both the tracking_events and customers table. I would compare the performance to just doing a SELECT COUNT(*) on both tables. If it is significantly different then there may be an issue but my guess is this is just the amount of time it takes to read both tables entirely into memory.
There are a few knobs for tuning how reads are done and since the defaults are oriented towards a much a bigger dataset you may want to change these.
spark.cassandra.input.split.size_in_mb approx amount of data to be fetched into a Spark partition 64 MB
spark.cassandra.input.fetch.size_in_rows number of CQL rows fetched per driver request 1000
I would make sure you are generating as many tasks as you have cores (at the minimum) so you can take advantage of all of your resources. To do this shrink the input.split.size
The fetch size controls how many rows are paged at a time by an executor core so increasing this can increase speed in some use cases.