Spark Caching: RDD Only 8% cached - scala

For my code snippet as below:
val levelsFile = sc.textFile(levelsFilePath)
val levelsSplitedFile = levelsFile.map(line => line.split(fileDelimiter, -1))
val levelPairRddtemp = levelsSplitedFile
.filter(linearr => ( linearr(pogIndex).length!=0))
.map(linearr => (linearr(pogIndex).toLong, levelsIndexes.map(x => linearr(x))
.filter(value => (!value.equalsIgnoreCase("") && !value.equalsIgnoreCase(" ") && !value.equalsIgnoreCase("null")))))
.mapValues(value => value.mkString(","))
.partitionBy(new HashPartitioner(24))
.persist(StorageLevel.MEMORY_ONLY_SER)
levelPairRddtemp.count // just to trigger rdd creation
Info
The size of the file is ~ 4G
I am using 2 executors(5G each) and
12 cores.
Spark version: 1.5.2
Problem
When I look at the SparkUI in Storage tab, What I see is :
Looking inside the RDD, seems only 2 out of 24 partitions are cached.
Any explanation to this behavior, and how to fix this.
EDIT 1: I just tried with 60 partitions for HashPartitioner as:
..
.partitionBy(new HashPartitioner(60))
..
And it Worked. Now I am getting entire RDD cached. Any guess what might have happened here? Can data skewness cause this behavior?
Edit-2: Logs containing BlockManagerInfo when I ran again with 24 partitions. This time 3/24 partitions were cached:
16/03/17 14:15:28 INFO BlockManagerInfo: Added rdd_294_14 in memory on ip-10-1-34-66.ec2.internal:47526 (size: 107.3 MB, free: 2.6 GB)
16/03/17 14:15:30 INFO BlockManagerInfo: Added rdd_294_17 in memory on ip-10-1-34-65.ec2.internal:57300 (size: 107.3 MB, free: 2.6 GB)
16/03/17 14:15:30 INFO BlockManagerInfo: Added rdd_294_21 in memory on ip-10-1-34-65.ec2.internal:57300 (size: 107.4 MB, free: 2.5 GB)

I believe that this happens because the memory limits are reached, or even more on point, the memory options you use don't let your job utilize all the resources.
Increasing the #partitions, means decreasing the size of every task, which might explain the behavior.

Related

Facing large data spills for small datasets on spark

I am trying to run some spark sql on NOA datasets available here:
https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/2021/
I am trying to run some query which involves grouping and sorting.
df
.groupBy("COUNTRY_FULL")
.agg(max("rank"), last("consecutive").as("consecutive"))
.withColumn("maxDays", maxDaysTornodoUdf(col("consecutive")))
.sort(col("maxDays").desc)
.limit(1)
.show()
The input size is just 50 MB zipped csvs and I am running this locally (4 cores).
These are the settings I use.
spark.driver.memory: 14g
spark.sql.windowExec.buffer.in.memory.threshold: 20000
spark.sql.windowExec.buffer.spill.threshold : 20000
spark.sql.shuffle.partitions : 400
I see too many disk spills for such a small data
21/08/16 10:23:13 INFO UnsafeExternalSorter: Thread 54 spilling sort data of 416.0 MB to disk (371 times so far)
21/08/16 10:23:14 INFO UnsafeExternalSorter: Thread 79 spilling sort data of 416.0 MB to disk (130 times so far)
21/08/16 10:23:14 INFO UnsafeExternalSorter: Thread 53 spilling sort data of 400.0 MB to disk (240 times so far)
21/08/16 10:23:14 INFO UnsafeExternalSorter: Thread 69 spilling sort data of 400.0 MB to disk (24 times so far)
21/08/16 10:23:16 INFO UnsafeExternalSorter: Thread 54 spilling sort data of 416.0 MB to disk (372 times so far)
21/08/16 10:23:16 INFO UnsafeExternalSorter: Thread 79 spilling sort data of 416.0 MB to disk (131 times so far)
However, when I check the Spark UI, the spillage doesnt seem to be much
Eventually the spark job terminates with error Not Enough memory
I do not understand what is happening.
You are using 400 as spark.sql.shuffle.partitions, which is too much for the data size which you are dealing with.
Having more shuffle partitions for lesser amount of data causes more partitions/tasks and it will reduce the performance. Read best practices to configure shuffle partition here.
Try reducing shuffle partitions. You may try setting it to spark.sparkContext.defaultParallelism.

What is the expected behavior for in-memory data-structures in Spark executors?

I want to understand if I expect the following behavior properly.
let's say I have 100 executors, each with 4 cores (meaning threads)
I am processing a very large RDD, and the rows inside contain a some_class that could be un-valid, if it is - I don't want to process the given row.
I don't want to use a broadcast since the invalid rows are determined to be invalid on the fly (during the RDD processing).
I thought of using an in-memory set and in the worst-case scenario, each executor will one time process a "bad" row - I am ok with that.
am I expecting the behavior properly or am I missing something?
val some_set = mutable.HashSet[String]
some_rdd.filterNot(r => some_set.contains(r.some_class.id)
.map(some_row => {
try{
some_def(some_row)
}
catch{
case e:Throwable => {
some_set.add(some_row.some_class.id)
log.info("some error")
}
}
}
In your example some_set will be serialized and sent to the executor together with the task code. Considering the case in which the size of some_set is 10.000, then the max task size of your Spark program will be approximately 200KB (10000 x 20chars). That satisfies the current maximum recommended task size of 1MB. On the other hand, if the task size exceeds 1GB you should expect a warning similar to:
Stage 1 contains a task of very large size (1024 MB). The maximum
recommended task size is 1000 KB.
If for some reason the size of some_set will increase beyond the limit of 1MB in the future, consider using broadcasting.
Similar questions
How to resolve : Very large size tasks in spark
Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB

Spark Container running beyond physical limits

I've been searching a lot for a solution for the following issue. I'm using Scala 2.11.8 and Spark 2.1.0.
Application application_1489191400413_3294 failed 1 times due to AM Container for appattempt_1489191400413_3294_000001 exited with exitCode: -104
For more detailed output, check application tracking page:http://ip-172-31-17-35.us-west-2.compute.internal:8088/cluster/app/application_1489191400413_3294Then, click on links to logs of each attempt.
Diagnostics: Container [pid=23372,containerID=container_1489191400413_3294_01_000001] is running beyond physical memory limits.
Current usage: 1.4 GB of 1.4 GB physical memory used; 3.5 GB of 6.9 GB virtual memory used. Killing container.
Note that I've allotted a lot more than the 1.4 GB being reported in the error here. Since I see none of my executors failing, my read from this error was this the driver needs more memory. However, my settings don't seem to be propagating through.
I'm setting job parameters to yarn as follows:
val conf = new SparkConf()
.setAppName(jobName)
.set("spark.hadoop.mapred.output.committer.class", "com.company.path.DirectOutputCommitter")
additionalSparkConfSettings.foreach { case (key, value) => conf.set(key, value) }
// this is the implicit that we pass around
implicit val sparkSession = SparkSession
.builder()
.appName(jobName)
.config(conf)
.getOrCreate()
where the memory provisioning parameters in additionalSparkConfSettings were set with the following snippet:
HashMap[String, String](
"spark.driver.memory" -> "8g",
"spark.executor.memory" -> "8g",
"spark.executor.cores" -> "5",
"spark.driver.cores" -> "2",
"spark.yarn.maxAppAttempts" -> "1",
"spark.yarn.driver.memoryOverhead" -> "8192",
"spark.yarn.executor.memoryOverhead" -> "2048"
)
Are my settings really not propagating? Or am I misinterpreting the logs?
Thanks!
The overhead memory is required to be setup for both executor and driver and it should be fraction of driver and executor memory.
spark.yarn.executor.memoryOverhead = executorMemory * 0.10, with minimum of 384
The amount of off-heap memory (in megabytes) to be allocated per
executor. This is memory that accounts for things like VM overheads,
interned strings, other native overheads, etc. This tends to grow with
the executor size (typically 6-10%).
spark.yarn.driver.memoryOverhead = driverMemory * 0.10, with minimum of 384.
The amount of off-heap memory (in megabytes) to be allocated per
driver in cluster mode. This is memory that accounts for things like
VM overheads, interned strings, other native overheads, etc. This
tends to grow with the container size (typically 6-10%).
To learn more about memory optimizations please see Memory Management Overview
Also see following thread on SO Container is running beyond memory limits
Cheers !
The problem in my case was a simple albeit-easy-to-miss one.
Setting driver-level parameters inside code does not work in code. Because by then, it is apparently already too late and the configuration is ignored. I confirmed this with a few tests when I solved it months ago.
Executor parameters however can be set in code. However, keep parameter precedence protocols in mind if you end up setting the same parameter in different places.

increase task size spark [duplicate]

This question already has answers here:
Spark using python: How to resolve Stage x contains a task of very large size (xxx KB). The maximum recommended task size is 100 KB
(3 answers)
Closed 2 years ago.
I got a problem when I execute my code in spark-shell.
[Stage 1:> (0 + 0) / 16]
17/01/13 06:09:24 WARN TaskSetManager: Stage 1 contains a task of very large size (1057 KB). The maximum recommended task size is 100 KB.
[Stage 1:> (0 + 4) / 16]
After this warning the execution blocked.
Who can I solve it?
I tried this but it's doesn't solve the problem.
val conf = new SparkConf()
.setAppName("MyApp")
.setMaster("local[*]")
.set("spark.driver.maxResultSize", "3g")
.set("spark.executor.memory" ,"3g");
val sc = new SparkContext(conf);`
I had similar error:
scheduler.TaskSetManager: Stage 2 contains a task of very large size
(34564 KB). The maximum recommended task size is 100 KB
My input data was of size ~150MB with 4 partitions (i.e., each partition was of size ~30MB). That explains 34564 KB size mentioned in above error message.
Reason:
Task is the smallest unit of work in spark that acts on partitions of your input data. Hence, if spark tells that task's size is more than recommended size, it means that the partition its handling has way too much data.
Solution that worked for me:
reducing task size => reduce the data its handling => increase
numPartitions to break down data into smaller chunks
So, I tried increasing number of partitions and got rid of the error.
One can check number of partitions in dataframe via df.rdd.getNumPartitions
To increase partitions: df.repartition(100)
It's most likely because of large size requirements by the variables in any of your tasks.
The accepted answer to this question should help you.

Spark 1.5.2 Shuffle/Serialization - running out of memory

I am working with several hundred GB dataset (around 2B rows). One of the operation is to reduce RDD or scala case objects(containing doubles, maps, sets) into single entity. Initially my operation was performing groupByKey but it was slow and was doing high GC. so I tried convert it to aggregateByKey and later even into reduceByKey in a hope to avoid high user memory allocations, shuffle activity and high gc issue that I was encountering with groupBy.
Application resources: 23GB exec mem + 4GB overhead. 20 instances and 6 cores each. Played with shuffle ration from 0.2 to 0.4
Available cluster resources 10 nodes, 600GB total for yarn, 32GB max container size
2016-05-02 22:38:53,595 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to hdn2.mycorp:45993
2016-05-02 22:38:53,832 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.storage.BlockManagerInfo: Removed broadcast_4_piece0 on 10.250.70.117:52328 in memory (size: 2.1 KB, free: 15.5 MB)
2016-05-02 22:39:03,704 WARN [New I/O worker #5] org.jboss.netty.channel.DefaultChannelPipeline: An exception was thrown by a user handler while handling an exception event ([id: 0xa8147f0c, /10.250.70.110:48056 => /10.250.70.117:38300] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
at org.jboss.netty.buffer.CompositeChannelBuffer.toByteBuffer(CompositeChannelBuffer.java:649)
at org.jboss.netty.buffer.AbstractChannelBuffer.toByteBuffer(AbstractChannelBuffer.java:530)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:77)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:46)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:194)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromTaskLoop(AbstractNioWorker.java:152)
at org.jboss.netty.channel.socket.nio.AbstractNioChannel$WriteTask.run(AbstractNioChannel.java:335)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.rpc.akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
at akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-2] akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
67247,1 99%
About a Job
Read input data set having around 20 fields. 1B-2B. Create an output dataset aggregating over 10 unique fields. Which becomes query criteria basically. However out of those 10, 3 fields represent various combinations of them so that we don't have to query multiple records to get a set. out of those 3 fields lets sat a, b, and c each has 11, 2 and 2 possible values. so we could get max of 2^11 -1 * 2^2 - 1 * 2^2 -1 combination for given key.
//pseudo code where I use aggregateByKey
case class UserDataSet(salary: Double, members: Int, clicks: Map[Int, Long],
businesses: Map[Int, Set[Int]])...) //About 10 fileds with 5 of them are maps
def main() = {
create combinationRDD of type (String, Set[Set]) Rdd from input dataset which represent all combination
create a joinedRdd of type (String, UserDataSet) - where key at this point already a final key which contains 10 unique fields; value is a UserDataSet
//This is where things fails
val finalDataSet = joinedRdd.aggregateByKey(UserDataSet.getInstance())(processDataSeq, processDataMerge)
}
private def processDataMerge(map1: UserDataSet, map2: UserDataSet) = {
map1.clicks ++= map2.clicks (deep merge of course to avoid overwriting of map keys)
map1.salary += map2.salary
map1
}
So issue was indeed driver running out of memory and not executor. hence error was in driver logs. duh. however it wasn't very clear from logs. Driver ran out because 1) It was using default of -Xmx900m 2) Spark driver relies on akka libs and akka libs relies on stubborn JavaSerializer which uses Byte array instead of stream to serialize objects. As a temp solution I increased spark.driver.memory to 4096m in my case and I haven't seen memory error since. Thanks everyone for some insights into a problem space though.
To be able to help, you should post the code and also give an explanation of the input data.
Why the data ?
When aggregating by key, to achieve optimal parallelism and avoid issues, it's important to have an idea of what the key distribution looks like and also the cardinality.
Let me explain what they are and why they are important.
Let's say you're aggregating by country...there are about 250 countries on earth, so the cardinality of the key is around 250.
Cardinality is important because low cardinality may stifle your parallelism. For instance, if 90% of your data is for the US, and you have 250 nodes, one node will be processing 90% of the data.
That leads to the concept of distribution, that is, when you're grouping by key, how many values you have per key is your value distribution. For optimal parallelism, you ideally want roughly the same number of values for every key.
Now, if the cardinality of your data is very high, but the value distribution is not optimal, statistically things should even out.
For instance, let's say you have apache logs, where most users only visit a few pages, but some visit many (as it's the case with robots).
If the number of users is much greater than the number of your nodes, the users with lots of data get distributed around the nodes so parallelism is not that impacted.
Problems usually arise when you use keys with low cardinality.
If the distribution of the values is not good, it causes issues not unlikely an unbalanced washing machines.
Last but not least, it also depends greatly on what you're doing on the aggregateByKey. You can exhaust memory easily if you're leaking objects in either the map or reduce phase of processing.