Why SPARK repeat transformations after persist operations? - scala

I have next code. I am doing count to perform persist operation and fix transformations above. But I noticed that DAG and stages for 2 different count Jobs calls first persist twice (when I expect second persist method to be called in second count call)
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
.persist(StorageLevel.MEMORY_AND_DISK)
LOG.info(s"First count = " + df.count)
val filter: BaseFilter = new BaseFilter()
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
.map(parse)
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
.map(convert)
// Since we will perform count and partitionBy actions, compute all above transformations/ Second persist
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"Second count = $totalPoints")

When you say StorageLevel.MEMORY_AND_DISK spark tries to fit all the data into the memory and if it doesn't fit it spills to disk.
Now you are doing multiple persists here. In spark the memory cache is LRU so the later persists will overwrite the previous cached data.
Even if you specify StorageLevel.MEMORY_AND_DISK when the data is evicted from cache memory by another cached data spark doesn't spill that to the disk. So when you do the next count it needs to revaluate the DAG so that it can retrieve the partitions which aren't present in the cache.
I would suggest you to use StorageLevel.DISK_ONLY to avoid such re-computation.

Here's is the whole scenario.
persist and cache are also the transformation in Spark. After applying any one of the stated transformation, one should use any action in order to cache an RDD or DF to the memory.
Secondly, The unit of cache or persist is "partition". When cache or persist gets executed it will save only those partitions which can be hold in the memory. The remaining partition which cannot be saved on the memory- whole DAG will be executed again once any new action will be encountered.

try
val df = sparkSession.read
.parquet(bigData)
.filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
.as[SafegraphRawData]
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.repartition(nrInputPartitions)
// Firstly persist here since objects not fit in memory (Persist 67)
df.persist(StorageLevel.MEMORY_AND_DISK)

Related

Spark streaming slow down when using large broadcast objects in UDF

I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()

Does skipped stages have any performance impact on Spark job?

I am running a spark structured streaming job which involves creation of an empty dataframe, updating it using each micro-batch as below. With every micro batch execution, number of stages increases by 4. To avoid recomputation, I am persisting the updated StaticDF into memory after each update inside loop. This helps in skipping those additional stages which gets created with every new micro batch.
My questions -
1) Even though the total completed stages remains same as the increased stages are always skipped but can it cause a performance issue as there can be millions on skipped stages at one point of time?
2) What happens when somehow some part or all of cached RDD is not available? (node/executor failure). Spark documentation says that it doesn't materialise the whole data received from multiple micro batches so far so does it mean that it will need read all events again from Kafka to regenerate staticDF?
// one time creation of empty static(not streaming) dataframe
val staticDF_schema = new StructType()
.add("product_id", LongType)
.add("created_at", LongType)
var staticDF = sparkSession
.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], staticDF_schema)
// Note : streamingDF was created from Kafka source
streamingDF.writeStream
.trigger(Trigger.ProcessingTime(10000L))
.foreachBatch {
(micro_batch_DF: DataFrame) => {
// fetching max created_at for each product_id in current micro-batch
val staging_df = micro_batch_DF.groupBy("product_id")
.agg(max("created").alias("created"))
// Updating staticDF using current micro batch
staticDF = staticDF.unionByName(staging_df)
staticDF = staticDF
.withColumn("rnk",
row_number().over(Window.partitionBy("product_id").orderBy(desc("created_at")))
).filter("rnk = 1")
.drop("rnk")
.cache()
}
Even though the skipped stages doesn't need any computation but my job started failing after a certain number of batches. This was because of DAG growth with every batch execution, making it un-manageable and throwing stack overflow exception.
To avoid this, I had to break the spark lineage so that number of stages don't increase with every run (even if they are skipped)

Iterative caching vs checkpointing in Spark

I have an iterative application running on Spark that I simplified to the following code:
var anRDD: org.apache.spark.rdd.RDD[Int] = sc.parallelize((0 to 1000))
var c: Long = Int.MaxValue
var iteration: Int = 0
while (c > 0) {
iteration += 1
// Manipulate the RDD and cache the new RDD
anRDD = anRDD.zipWithIndex.filter(t => t._2 % 2 == 1).map(_._1).cache() //.localCheckpoint()
// Actually compute the RDD and spawn a new job
c = anRDD.count()
println(s"Iteration: $iteration, Values: $c")
}
What happens to the memory allocation within consequent jobs?
Does the current anRDD "override" the previous ones or are they all kept into memory? In the long run, this can throw some memory exception
Do localCheckpoint and cache have different behaviors? If localCheckpoint is used in place of cache, as localCheckpoint truncates the RDD lineage, then I would expect the previous RDDs to be overridden
Unfortunately seems that Spark is not good for things like that.
Your original implementation is not viable because on each iteration the newer RDD will have an internal reference to the older one so all RDDs pile up in memory.
localCheckpoint is an approximation of what you are trying to achieve. It does truncate RDD's lineage but you lose fault tolerance. It's clearly stated in the documentation for this method.
checkpoint is also an option. It is safe but it would dump the data to hdfs on each iteration.
Consider redesigning the approach. Such hacks could bite sooner or later.
RDDs are immutable so each transformation will return a new RDD.
All anRDD will be kept in memory. See below(running two iteration for your code), id will be different for all the RDDs
So yes, In the long run, this can throw some memory exception. And you
should unpersist rdd after you are done processing on it.
localCheckpoint has different use case than cache. It is used to truncate the lineage of RDD. It doesn't store RDD to disk/local It improves performance but decreases fault tolerance in turn.

When to persist and when to unpersist RDD in Spark

Lets say i have the following:
val dataset2 = dataset1.persist(StorageLevel.MEMORY_AND_DISK)
val dataset3 = dataset2.map(.....)
If you do a transformation on the dataset2 then you have to persist it and pass it to dataset3 and unpersist the previous or not?
I am trying to figure out when to persist and unpersist RDDs. With every new rdd that is created do i have to persist it?
Thanks
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.
Refrence from: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence

Spark doesnt let me count my joined dataframes

New at Spark Jobs and I have the following problem.
When I run a count on any of the newly joined dataframes, the job runs for ages and spills memory to disk. Is there any logic error in here?
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create three dataframes for sent and clicked files. Mark them as raw, since they will be renamed
val dfSentRaw = sqlContext.read.parquet(inputPathSent)
val dfClickedRaw = sqlContext.read.parquet(inputPathClicked)
val dfFailedRaw = sqlContext.read.parquet(inputPathFailed)
// Rename the columns to avoid ambiguity when accessing the fields later
val dfSent = dfSentRaw.withColumnRenamed("customer_id", "sent__customer_id")
.withColumnRenamed("campaign_id", "sent__campaign_id")
.withColumnRenamed("ced_email", "sent__ced_email")
.withColumnRenamed("event_captured_dt", "sent__event_captured_dt")
.withColumnRenamed("riid", "sent__riid")
val dfClicked = dfClickedRaw.withColumnRenamed("customer_id", "clicked__customer_id")
.withColumnRenamed("event_captured_dt", "clicked__event_captured_dt")
val dfFailed = dfFailedRaw.withColumnRenamed("customer_id", "failed__customer_id")
// LEFT Join with CLICKED on two fields, customer_id and campaign_id
val dfSentClicked = dfSent.join(dfClicked, dfSent("sent__customer_id") === dfClicked("clicked__customer_id")
&& dfSent("sent__campaign_id") === dfClicked("campaign_id"), "left")
dfSentClicked.count() //THIS WILL NOT WORK
val dfJoined = dfSentClicked.join(dfFailed, dfSentClicked("sent__customer_id") === dfFailed("failed__customer_id")
&& dfSentClicked("sent__campaign_id") === dfFailed("campaign_id"), "left")
Why cant these two/three dataframes be counted anymore? Did I mess up some indexing by renaming?
Thank you!
That count call is the only actual materialization of your Spark job here, so it's not really count that is a problem but the shuffle that is being done for the join right before it. You don't have enough memory to do the join without spilling to disk. Spilling to disk in a shuffle is a very easy way to make your Spark jobs take forever =).
One thing that really helps prevent spilling with shuffles is having more partitions. Then there is less data moving through the shuffles at any given time. You can set spark.sql.shuffle.partitions which controls the number of partitions used by Spark Sql in a join or aggregation. It defaults to 200, so you can trying a higher setting. http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
You could increase the heap size of your local Spark allocation and/or increase the fraction of memory usable for shuffles by increasing spark.shuffle.memoryFraction (defaults to 0.4) and decreasing spark.storage.memoryFraction (defaults to 0.6). The Storage fraction is used for example when you make a .cache call and you might not care about that.
If you are so inclined to absolutely avoid the spills outright, you can turn off spilling by setting spark.shuffle.spill to false. I believe this will throw an exception if you run out of memory and need to spill instead of silently taking forever and could help you configure your memory allocation faster.