Spark doesnt let me count my joined dataframes - scala

New at Spark Jobs and I have the following problem.
When I run a count on any of the newly joined dataframes, the job runs for ages and spills memory to disk. Is there any logic error in here?
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create three dataframes for sent and clicked files. Mark them as raw, since they will be renamed
val dfSentRaw = sqlContext.read.parquet(inputPathSent)
val dfClickedRaw = sqlContext.read.parquet(inputPathClicked)
val dfFailedRaw = sqlContext.read.parquet(inputPathFailed)
// Rename the columns to avoid ambiguity when accessing the fields later
val dfSent = dfSentRaw.withColumnRenamed("customer_id", "sent__customer_id")
.withColumnRenamed("campaign_id", "sent__campaign_id")
.withColumnRenamed("ced_email", "sent__ced_email")
.withColumnRenamed("event_captured_dt", "sent__event_captured_dt")
.withColumnRenamed("riid", "sent__riid")
val dfClicked = dfClickedRaw.withColumnRenamed("customer_id", "clicked__customer_id")
.withColumnRenamed("event_captured_dt", "clicked__event_captured_dt")
val dfFailed = dfFailedRaw.withColumnRenamed("customer_id", "failed__customer_id")
// LEFT Join with CLICKED on two fields, customer_id and campaign_id
val dfSentClicked = dfSent.join(dfClicked, dfSent("sent__customer_id") === dfClicked("clicked__customer_id")
&& dfSent("sent__campaign_id") === dfClicked("campaign_id"), "left")
dfSentClicked.count() //THIS WILL NOT WORK
val dfJoined = dfSentClicked.join(dfFailed, dfSentClicked("sent__customer_id") === dfFailed("failed__customer_id")
&& dfSentClicked("sent__campaign_id") === dfFailed("campaign_id"), "left")
Why cant these two/three dataframes be counted anymore? Did I mess up some indexing by renaming?
Thank you!

That count call is the only actual materialization of your Spark job here, so it's not really count that is a problem but the shuffle that is being done for the join right before it. You don't have enough memory to do the join without spilling to disk. Spilling to disk in a shuffle is a very easy way to make your Spark jobs take forever =).
One thing that really helps prevent spilling with shuffles is having more partitions. Then there is less data moving through the shuffles at any given time. You can set spark.sql.shuffle.partitions which controls the number of partitions used by Spark Sql in a join or aggregation. It defaults to 200, so you can trying a higher setting. http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
You could increase the heap size of your local Spark allocation and/or increase the fraction of memory usable for shuffles by increasing spark.shuffle.memoryFraction (defaults to 0.4) and decreasing spark.storage.memoryFraction (defaults to 0.6). The Storage fraction is used for example when you make a .cache call and you might not care about that.
If you are so inclined to absolutely avoid the spills outright, you can turn off spilling by setting spark.shuffle.spill to false. I believe this will throw an exception if you run out of memory and need to spill instead of silently taking forever and could help you configure your memory allocation faster.

Related

Spark streaming slow down when using large broadcast objects in UDF

I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()

Why SPARK repeat transformations after persist operations?

I have next code. I am doing count to perform persist operation and fix transformations above. But I noticed that DAG and stages for 2 different count Jobs calls first persist twice (when I expect second persist method to be called in second count call)
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
.persist(StorageLevel.MEMORY_AND_DISK)
LOG.info(s"First count = " + df.count)
val filter: BaseFilter = new BaseFilter()
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
.map(parse)
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
.map(convert)
// Since we will perform count and partitionBy actions, compute all above transformations/ Second persist
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"Second count = $totalPoints")
When you say StorageLevel.MEMORY_AND_DISK spark tries to fit all the data into the memory and if it doesn't fit it spills to disk.
Now you are doing multiple persists here. In spark the memory cache is LRU so the later persists will overwrite the previous cached data.
Even if you specify StorageLevel.MEMORY_AND_DISK when the data is evicted from cache memory by another cached data spark doesn't spill that to the disk. So when you do the next count it needs to revaluate the DAG so that it can retrieve the partitions which aren't present in the cache.
I would suggest you to use StorageLevel.DISK_ONLY to avoid such re-computation.
Here's is the whole scenario.
persist and cache are also the transformation in Spark. After applying any one of the stated transformation, one should use any action in order to cache an RDD or DF to the memory.
Secondly, The unit of cache or persist is "partition". When cache or persist gets executed it will save only those partitions which can be hold in the memory. The remaining partition which cannot be saved on the memory- whole DAG will be executed again once any new action will be encountered.
try
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
df.persist(StorageLevel.MEMORY_AND_DISK)

Partial unpersist of spark dataframe in scala

I am performing a join by bucketizing one of my dataframes and, given the frequent use of the dataframes I have persisted them to cache. However I see that the total time spent in GC has increased to >10% of execution time.
Is it possible for me to perform in scala a partial unpersist of the dataframe to actively prune cached data as and when it turns obsolete for my use case.
EX:
srcDF.persist()
srcDF.count()
val df1 = srcDF.filter(col("bucket_id") === lit(1))
val result = df1.join(otherDF, Seq("field1"))
Given this execution is it possible for me to do a partial unpersist() to exclude everything with "bucket_id" = 1

Stack overflow error when loading a large table from mongodb to spark

all,
I have a table which is about 1TB in mongodb. I tried to load it in spark using mongo connector but I keep getting stack overflow after 18 minutes executing.
java.lang.StackOverflowError:
at scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
....
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
16/06/29 08:42:22 INFO YarnAllocator: Driver requested a total number of 54692 executor(s).
16/06/29 08:42:22 INFO YarnAllocator: Will request 46501 executor containers, each with 4 cores and 5068 MB memory including 460 MB overhead
Is it because I didn't provide enough memory ? Or should I provide more storage?
I have tried to add checkpoint, but it doesn't help.
I have changed some value in my code because they relate to my company database, but the whole code is still ok for this question.
val sqlContext = new SQLContext(sc)
val builder = MongodbConfigBuilder(Map(Host -> List("mymongodurl:mymongoport"), Database -> "mymongoddb", Collection ->"mymongocollection", SamplingRatio -> 0.01, WriteConcern -> "normal"))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)
mongoRDD.registerTempTable("mytable")
val dataFrame = sqlContext.sql("SELECT u_at, c_at FROM mytable")
val deltaCollect = dataFrame.filter("u_at is not null and c_at is not null and u_at != c_at").rdd
val mapDelta = deltaCollect.map {
case Row(u_at: Date, c_at: Date) =>{
if(u_at.getTime == c_at.getTime){
(0.toString, 0l)
}
else{
val delta = ( u_at.getTime - c_at.getTime ) / 1000/60/60/24
(delta.toString, 1l)
}
}
}
val reduceRet = mapDelta.reduceByKey(_+_)
val OUTPUT_PATH = s"./dump"
reduceRet.saveAsTextFile(OUTPUT_PATH)
As you know, Apache Spark does in-memory processing while executing a job, i.e. it loads the data to be worked on into the memory. Here as per your question and comments, you have a dataset as large as 1TB and the memory available to Spark is around 8GB per core. Hence your spark executor will always be out of memory in this scenario.
To avoid this you can follow either of the below two options:
Change your RDD Storage Level to MEMORY_AND_DISK. In this way Spark will not load the full data into its memory; rather it will try to spill the extra data into disk. But, this way the performance will decrease because of the transactions done between the memory and disk. Check out RDD persistence
Increase your core memory so that Spark can load even 1TB of data fully into the memory. In this way the performance will be good, but infrastructure cost will increase.
I add another java option "-Xss32m" to spark driver to raise the memory of stack for every thread , and this exception is not throwing any more. How stupid was I , I should have tried it earlier. But another problem is shown, I will have to check more. still great thanks for your help.

When to persist and when to unpersist RDD in Spark

Lets say i have the following:
val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK)
val dataset3 = dataset2.map(.....)
If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist the previous or not?
I am trying to figure out when to persist and unpersist RDDs. With every new rdd that is created do i have to persist it?
Thanks
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
Refrence from: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence