Partial unpersist of spark dataframe in scala - scala

I am performing a join by bucketizing one of my dataframes and, given the frequent use of the dataframes I have persisted them to cache. However I see that the total time spent in GC has increased to >10% of execution time.
Is it possible for me to perform in scala a partial unpersist of the dataframe to actively prune cached data as and when it turns obsolete for my use case.
EX:
srcDF.persist()
srcDF.count()
val df1 = srcDF.filter(col("bucket_id") === lit(1))
val result = df1.join(otherDF, Seq("field1"))
Given this execution is it possible for me to do a partial unpersist() to exclude everything with "bucket_id" = 1

Related

Spark - parallel computation for different dataframes

A premise: this question might sound idiotic, but I guess I fell into confusion and/ignorance.
The question is: does Spark already optmize its physical plan to execute computations on unrelated dataframes to be in parallel? If not, would it be advisable to try and parallelize such processes? Example below.
Let's assume I have the following scenario:
val df1 = read table into dataframe
val df2 = read another table into dataframe
val aTransformationOnDf1 = df1.filter(condition).doSomething
val aSubSetOfTransformationOnDf1 = aTransformationOnDf1.doSomeOperations
// Push to Kafka
aSubSetOfTransformationOnDf1.toJSON.pushToKafkaTopic
val anotherTransformationOnDf1WithDf2 = df1.filter(anotherCondition).join(df2).doSomethingElse
val yetAnotherTransformationOnDf1WithDf2 = df1.filter(aThirdCondition).join(df2).doAnotherThing
val unionAllTransformation = aTransformationOnDf1
.union(anotherTransformationOnDf1WithDf2)
.union(yetAnotherTransformationOnDf1WithDf2)
unionAllTransformation.write.mode(whatever).partitionBy(partitionColumn).save(wherever)
Basically I have two initial dataframes. One is an avent log with past events and new events to process. As an example:
a subset of these new events must be processed and pushed to Kafka.
a subset of the past events could have updates, so they must be processed alone
another subset of the past events could have another kind of updates, so they must be processed alone
In the end, all processed events are unified in one dataframe to be written back to the events' log table.
Question: does Spark process the different subsets in parallel or sequentially (and onyl computation within each individual dataframe is performed distributedly)?
If not, could we enforce parallel computation of each individual subset before the union? I know Scala has a Future propery, though I never used it.
Something like>
def unionAllDataframes(df1: DataFrame, df2: DataFrame, df3: DataFrame): Future[DafaFrame] = {
Future { df1.union(df2).union(df2) }
}
// At the end
val finalDf = unionAllDataframes(
aTransformationOnDf1,
anotherTransformationOnDf1WithDf2,
yetAnotherTransformationOnDf1WithDf2)
finalDf.onComplete({
case Success(df) => df.write(etc...)
case Failure(exception) => handleException(exception)
})
Sorry for the horrendous design and probably the wrong usage of Future. Once again, I am a bit confused on this scenario and I am trying to micro-optimize this passage (if possible).
Thanks a lot in advance!
Cheers

Iterative caching vs checkpointing in Spark

I have an iterative application running on Spark that I simplified to the following code:
var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000))
var c: Long = Int.MaxValue
var iteration: Int = 0
while (c > 0) {
iteration += 1
// Manipulate the RDD and cache the new RDD
anRDD = anRDD.zipWithIndex.filter(t => t._2 % 2 == 1).map(_._1).cache() //.localCheckpoint()
// Actually compute the RDD and spawn a new job
c = anRDD.count()
println(s"Iteration: $iteration, Values: $c")
}
What happens to the memory allocation within consequent jobs?
Does the current anRDD "override" the previous ones or are they all kept into memory? In the long run, this can throw some memory exception
Do localCheckpoint and cache have different behaviors? If localCheckpoint is used in place of cache, as localCheckpoint truncates the RDD lineage, then I would expect the previous RDDs to be overridden
Unfortunately seems that Spark is not good for things like that.
Your original implementation is not viable because on each iteration the newer RDD will have an internal reference to the older one so all RDDs pile up in memory.
localCheckpoint is an approximation of what you are trying to achieve. It does truncate RDD's lineage but you lose fault tolerance. It's clearly stated in the documentation for this method.
checkpoint is also an option. It is safe but it would dump the data to hdfs on each iteration.
Consider redesigning the approach. Such hacks could bite sooner or later.
RDDs are immutable so each transformation will return a new RDD.
All anRDD will be kept in memory. See below(running two iteration for your code), id will be different for all the RDDs
So yes, In the long run, this can throw some memory exception. And you
should unpersist rdd after you are done processing on it.
localCheckpoint has different use case than cache. It is used to truncate the lineage of RDD. It doesn't store RDD to disk/local It improves performance but decreases fault tolerance in turn.

Why SPARK repeat transformations after persist operations?

I have next code. I am doing count to perform persist operation and fix transformations above. But I noticed that DAG and stages for 2 different count Jobs calls first persist twice (when I expect second persist method to be called in second count call)
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
.persist(StorageLevel.MEMORY_AND_DISK)
LOG.info(s"First count = " + df.count)
val filter: BaseFilter = new BaseFilter()
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
.map(parse)
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
.map(convert)
// Since we will perform count and partitionBy actions, compute all above transformations/ Second persist
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"Second count = $totalPoints")
When you say StorageLevel.MEMORY_AND_DISK spark tries to fit all the data into the memory and if it doesn't fit it spills to disk.
Now you are doing multiple persists here. In spark the memory cache is LRU so the later persists will overwrite the previous cached data.
Even if you specify StorageLevel.MEMORY_AND_DISK when the data is evicted from cache memory by another cached data spark doesn't spill that to the disk. So when you do the next count it needs to revaluate the DAG so that it can retrieve the partitions which aren't present in the cache.
I would suggest you to use StorageLevel.DISK_ONLY to avoid such re-computation.
Here's is the whole scenario.
persist and cache are also the transformation in Spark. After applying any one of the stated transformation, one should use any action in order to cache an RDD or DF to the memory.
Secondly, The unit of cache or persist is "partition". When cache or persist gets executed it will save only those partitions which can be hold in the memory. The remaining partition which cannot be saved on the memory- whole DAG will be executed again once any new action will be encountered.
try
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
df.persist(StorageLevel.MEMORY_AND_DISK)

Coalesce reduces parallelism of entire stage (spark)

Sometimes Spark "optimizes" a dataframe plan in an inefficient way. Consider the following example in Spark 2.1 (can also be reproduced in Spark 1.6):
val df = sparkContext.parallelize((1 to 500).map(i=> scala.util.Random.nextDouble),100).toDF("value")
val expensiveUDF = udf((d:Double) => {Thread.sleep(100);d})
val df_result = df
.withColumn("udfResult",expensiveUDF($"value"))
df_result
.coalesce(1)
.saveAsTable(tablename)
In this example I want to write 1 file after an expensive transformation of a dataframe (this is just an example to demonstrate the issue). Spark moves the coalesce(1) up such that the UDF is only applied to a dataframe containing 1 partition, thus destroying parallelism (interestingly repartition(1) does not behave this way).
To generalize, this behavior occurs when I want to increase parallelism in a certain part of my transformation, but decrease parallelism thereafter.
I've found one workaround which consists of caching the dataframe and then triggering the complete evaluation of the dataframe:
val df = sparkContext.parallelize((1 to 500).map(i=> scala.util.Random.nextDouble),100).toDF("value")
val expensiveUDF = udf((d:Double) => {Thread.sleep(100);d})
val df_result = df
.withColumn("udfResult",expensiveUDF($"value"))
.cache
df_result.rdd.count // trigger computation
df_result
.coalesce(1)
.saveAsTable(tablename)
My question is: is there another way to tell Spark not to decrease parallelism in such cases?
Actually it is not because of SparkSQL's optimization, SparkSQL doesn't change the position of Coalesce operator, as the executed plan shows:
Coalesce 1
+- *Project [value#2, UDF(value#2) AS udfResult#11]
+- *SerializeFromObject [input[0, double, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]
I quote a paragraph from coalesce API's description:
Note: This paragraph is added by the jira SPARK-19399. So it should not be found in 2.0's API.
However, if you're doing a drastic coalesce, e.g. to numPartitions =
1, this may result in your computation taking place on fewer nodes
than you like (e.g. one node in the case of numPartitions = 1). To
avoid this, you can call repartition. This will add a shuffle step,
but means the current upstream partitions will be executed in parallel
(per whatever the current partitioning is).
The coalesce API doesn't perform a shuffle, but results in a narrow dependency between previous RDD and current RDD. As RDD is lazy evaluation, the computation is actually done with coalesced partitions.
To prevent it, you should use repartition API.

Spark doesnt let me count my joined dataframes

New at Spark Jobs and I have the following problem.
When I run a count on any of the newly joined dataframes, the job runs for ages and spills memory to disk. Is there any logic error in here?
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create three dataframes for sent and clicked files. Mark them as raw, since they will be renamed
val dfSentRaw = sqlContext.read.parquet(inputPathSent)
val dfClickedRaw = sqlContext.read.parquet(inputPathClicked)
val dfFailedRaw = sqlContext.read.parquet(inputPathFailed)
// Rename the columns to avoid ambiguity when accessing the fields later
val dfSent = dfSentRaw.withColumnRenamed("customer_id", "sent__customer_id")
.withColumnRenamed("campaign_id", "sent__campaign_id")
.withColumnRenamed("ced_email", "sent__ced_email")
.withColumnRenamed("event_captured_dt", "sent__event_captured_dt")
.withColumnRenamed("riid", "sent__riid")
val dfClicked = dfClickedRaw.withColumnRenamed("customer_id", "clicked__customer_id")
.withColumnRenamed("event_captured_dt", "clicked__event_captured_dt")
val dfFailed = dfFailedRaw.withColumnRenamed("customer_id", "failed__customer_id")
// LEFT Join with CLICKED on two fields, customer_id and campaign_id
val dfSentClicked = dfSent.join(dfClicked, dfSent("sent__customer_id") === dfClicked("clicked__customer_id")
&& dfSent("sent__campaign_id") === dfClicked("campaign_id"), "left")
dfSentClicked.count() //THIS WILL NOT WORK
val dfJoined = dfSentClicked.join(dfFailed, dfSentClicked("sent__customer_id") === dfFailed("failed__customer_id")
&& dfSentClicked("sent__campaign_id") === dfFailed("campaign_id"), "left")
Why cant these two/three dataframes be counted anymore? Did I mess up some indexing by renaming?
Thank you!
That count call is the only actual materialization of your Spark job here, so it's not really count that is a problem but the shuffle that is being done for the join right before it. You don't have enough memory to do the join without spilling to disk. Spilling to disk in a shuffle is a very easy way to make your Spark jobs take forever =).
One thing that really helps prevent spilling with shuffles is having more partitions. Then there is less data moving through the shuffles at any given time. You can set spark.sql.shuffle.partitions which controls the number of partitions used by Spark Sql in a join or aggregation. It defaults to 200, so you can trying a higher setting. http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
You could increase the heap size of your local Spark allocation and/or increase the fraction of memory usable for shuffles by increasing spark.shuffle.memoryFraction (defaults to 0.4) and decreasing spark.storage.memoryFraction (defaults to 0.6). The Storage fraction is used for example when you make a .cache call and you might not care about that.
If you are so inclined to absolutely avoid the spills outright, you can turn off spilling by setting spark.shuffle.spill to false. I believe this will throw an exception if you run out of memory and need to spill instead of silently taking forever and could help you configure your memory allocation faster.