Why does calling cache take a long time on a Spark Dataset?

Why does calling cache take a long time on a Spark Dataset? - scala

I'm loading large datasets and then caching them for reference throughout my code. The code looks something like this:
val conversations = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", jdbcUrl)
.option("tempdir", tempDir)
.option("forward_spark_s3_credentials","true")
.option("query", "SELECT * FROM my_table "+
"WHERE date <= '2017-06-03' "+
"AND date >= '2017-03-06' ")
.load()
.cache()
If I leave off the cache, the code executes quickly because Datasets are evaluated lazily. But if I put on the cache(), the block takes a long time to run.
From the online Spark UI's Event Timeline, it appears that the SQL table is being transmitted to the worker nodes and then cached on the worker nodes.
Why is cache executing immediately? The source code appears to only mark it for caching when the data is computed:
The source code for Dataset calls through to this code in CacheManager.scala when cache or persist is called:
/**
* Caches the data produced by the logical representation of the given [[Dataset]].
* Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
* recomputing the in-memory columnar representation of the underlying table is expensive.
*/
def cacheQuery(
query: Dataset[_],
tableName: Option[String] = None,
storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
val planToCache = query.logicalPlan
if (lookupCachedData(planToCache).nonEmpty) {
logWarning("Asked to cache already cached data.")
} else {
val sparkSession = query.sparkSession
cachedData.add(CachedData(
planToCache,
InMemoryRelation(
sparkSession.sessionState.conf.useCompression,
sparkSession.sessionState.conf.columnBatchSize,
storageLevel,
sparkSession.sessionState.executePlan(planToCache).executedPlan,
tableName)))
}
}
Which only appears to mark for caching rather than actually caching the data. And I would expect caching to return immediately based on other answers on Stack Overflow as well.
Has anyone else seen caching happening immediately before an action is performed on the dataset? Why does this happen?

cache is one of those operators that causes execution of a dataset. Spark will materialize that entire dataset to memory. If you invoke cache on an intermediate dataset that is quite big, this may take a long time.
What might be problematic is that the cached dataset is only stored in memory. When it no longer fits, partitions of the dataset get evicted and are re-calculated as needed (see https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence). With too little memory present, your program could spend a lot of time on re-calculations.
To speed things up with caching, you could give the application more memory, or you can try to use persist(MEMORY_AND_DISK) instead of cache.

I now believe that, as Erik van Oosten answers, the cache() command causes the query to execute.
A close look at the code in my OP does indeed appear to show that the command is being cached. There are two key lines where I think the caching is occurring:
cachedData.add(CachedData(...))
This line creates a new CachedData object, which is added to a cachedData collection of some sort. While the cached data object may be a placeholder to hold cached data later on, it seems more likely that the CachedData object truly holds cached data.
And more importantly, this line:
sparkSession.sessionState.executePlan(planToCache).executedPlan
appears to actually execute the plan. So based on my experience, Erik van Oosten gut feeling about what's going on here, and the source code, I believe that calling cache() causes a Spark Dataset's plan to be executed.

Related

Spark - Stategy for persisting derived dataframes when parent DF is already persisted

I have not found a clear answer to this question yet, even though there are multiple similar questions in SO.
I don't fill-in all the details for the code below, as the actual transformations are not important for my questions.
// Adding _corrupt_record to have records that are not valid json
val inputDf = spark.read.schema(someSchema.add("_corrupt_record", StringType)).json(path)
/**
* The following lazy-persists the DF and does not return a new DF. Since
* Spark>=2.3 the queries from raw JSON/CSV files are disallowed when the
* referenced columns only include the internal corrupt record column
* (named _corrupt_record by default). Caching is the workaround.
*/
inputDf.persist
val uncorruptedDf = inputDf.filter($"_corrupt_record".isNull)
val corruptedDf = inputDf.filter($"_corrupt_record".isNotNull)
// Doing a count on both derived DFs - corruptedDf will also be output for further investigation
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
corruptedDf.write.json(corruptedOutputPath)
// Not corrupted data will be used for some complicated transformations
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
As you can see, I marked the input dataframe inputDf for persistence (see the reason here), but never did a count on it. Then I derived two dataframes, to both of which I did a count.
Question 1: When I do uncorruptedDf.count, what does it do to the parent dataframe inputdf? Does it trigger caching of the whole inputDf, the part of it that corresponds to uncorruptedDf.count, or nothing? RDD Documentation says that:
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
Question 2: Does it make sense at this point (before the two count) to persist the derived dataframes corruptedDf and uncorruptedDf and unpersist inputDf? Since there are two actions happening on each derived dataframe, I would say yes, but I am not sure. If so.. what is the correct place to unpersist the parent DF below? (A), (B), or (C)?
uncorruptedDf.persist
corruptedDf.persist
// (A) I don't think I should inputDf.unpersist here, since derived DFs are not yet persisted
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
// (B) This seems a reasonable place, to free some memory
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
// (C) Is there any value from unpersisting here?
Question 3: Same as previous question but for finalDf vs corruptedDf. As can be seen I perform two actions on the finalDf: count and write.
Thanks in advance!

For question 1:
Yes it would persist the inputdf when the first count is called which is uncorrupted.count() but it won't persist any transformation that you do on the inputdf. On next count it won't read the data from the json file but it would read it from the partitions that it cached.
For question 2:
I think you should not persist the inputdf as there is nothing that you gain. Persisting the corrupted and uncorrupted of makes sense as you are performing multiple actions on it. You are just performing transformation on the inputdf to filter corrupt and uncorrupt records and spark is smart enough to combine it as one step during its physical planning stage.To conclude you should not persist inputdf and in that way you do not have to worry about unpersisting it.
For question 3:
You should not persist final dataframe as you are only performing one action on it of writing it to physical path as json file.
PS: don't try to cache/ persist each dataframe as caching itself has performance impact and have to do additional work to keep the data in memory or save to disk based on the storage level that you specify. If there are less transformations and are not complex it better to avoid caching. You can use explain command on the dataframe to see the physical ans logical plans.

How to keep RDD persisted and consistent?

I have the following code (simplification for a complex situation):
val newRDD = prevRDD.flatMap{a =>
Array.fill[Int](scala.util.Random.nextInt(10)){scala.util.Random.nextInt(2)})
}.persist()
val a = newRDD.count
val b = newRDD.count
and even that the RDD supposed to be persisted (and therefore consistent), a and b are not identical in most cases.
Is there a way to keep the results of the first action consistent, so when the second "action" will be called, the results of the first action will be returned?
* Edit *
The issue that I have is apparently caused by zipWithIndex method exists in my code - which creates indices higher than the count. I'll ask about it in a different thread. Thanks

There is no way to make sure 100% consistent.
When you call persist it will try to cache all of partitions on memory if it fits.
Otherwise, It will recompute partitions which are not fit on memory.

How persistence works in Spark

I am persisting some dataframes which are stored in var. Now when values to that var changes, how does persistence works? For example:
var checkedBefore_c = AddressValidation.validateAddressInAI(inputAddressesDF, addressDimTablePath, target_o, target_c, autoSeqColName).distinct.filter(col(CommonConstants.API_QUALITY_RATING) >= minQualityThreshold)
checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)
var pre_checkedBefore_c = checkedBefore_c.except(checkedBefore_o)
pre_checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)
checkedBefore_c = pre_checkedBefore_c.drop(target_o).drop(autoSeqColName)
.withColumn(target_o, pre_checkedBefore_c(target_c))
.withColumn(CommonConstants.API_STATUS, lit("AI-INSERT"))
.withColumn(CommonConstants.API_ERROR_MESSAGE, lit(""))
checkedBefore_c = CommonUtils.addAutoIncremetColumn(checkedBefore_c, autoSeqColName)
checkedBefore_c = checkedBefore_c.select(addDimWithLoggingSchema.head, addDimWithLoggingSchema.tail: _*)
checkedBefore_c.persist(StorageLevel.MEMORY_AND_DISK_SER)

You are trying to persist checkedBefore_c DataFrame, but in your code you have not called any action.
Brief explanation
Spark has two type of operation, transformation and action.
Transformation: Transformations are lazy eventuated like, map, reduceByKey, etc.
Action: Actions are eager eventuated, like foreach, count, save etc.
persist and cache are also lazy operation, so till the time you invoke action persist and cache will not be performed.
For more details please refer Action in Spark. You could also refer this.
Now how persist work.
In persist, spark store partition in in memory or disk or both.
Their are various options, for all option refer, org.apache.spark.storage.StorageLevel source code.
Each executors will be responsible to store their partition, if in memory option is given, first it will try to fit all partition if it does not fit then it will clean old cache data based(It is LRY cache). If still all partition does not fit in memory it will cache partitions which fits in memory and it will leave the rest.
If memory with disk option is selected then, first it will perform all step mentioned above and then left partition will be stored in local disk.
If replication factor is two then, each partition will be cached in two different executors.
In you case you have passed MEMORY_AND_DISK_SER, which means all object will be serialized before caching. By default Java serialization is used, but you could override it and use Kyro serialization, which is recommended.

Dataset.unpersist() unexpectedly affects count of other RDD's

I ran into a strange problem where calling unpersist() on one Dataset affects the count of another Dataset in the same block of code. Unfortunately this happens during a complex long-running job with many Datasets so I can't summarize the whole thing here. I know this makes for a difficult question, however let me try to sketch it out. What I'm looking for is some confirmation that this behavior is unexpected and any ideas about why it may be occurring or how we can avoid it.
Edit: This problem as reported occurs on Spark 2.1.1, but does not occur on 2.1.0. The problem is 100% repeatable but only in my project with 1000's of lines of code and data, I'm working to try to distill it down to a concise example but have not yet been able, I will post any updates or re-submit my question if I find something. The fact that the exact same code and data works in 2.1.0 but not 2.1.1 leads me to believe it is due to something within Spark.
val claims:Dataset = // read claims from file
val accounts:Dataset = // read accounts from file
val providers:Dataset = // read providers from file
val payers:Dataset = // read payers from file
val claimsWithAccount:Dataset = // join claims and accounts
val claimsWithProvider:Dataset = // join claims and providers
val claimsWithPayer:Dataset = // join claimsWithProvider and payers
claimsWithPayer.persist(StorageLevel.MEMORY_AND_DISK)
log.info("claimsWithPayer = " + claimsWithPayer.count()) // 46
// This is considered unnecessary intermediate data and can leave the cache
claimsWithAccount.unpersist()
log.info("claimsWithPayer = " + claimsWithPayer.count()) // 41
Essentially, calling unpersist() on one of the intermediate data sets in a series of joins affects the number of rows in one of the later data sets, as reported by Dataset.count().
My understanding is that unpersist() should remove data from the cache but that it shouldn't affect the count or contents of other data sets? This is especially surprising since I explicitly persist claimsWithPayer before I unpersist the other data.

I believe the behaviour you experience is related to the change that is for "UNCACHE TABLE should un-cache all cached plans that refer to this table".
I think you may find more information in SPARK-21478 Unpersist a DF also unpersists related DFs where Xiao Li said:
This is by design. We do not want to use the invalid cached data.

Recursive Dataframe operations

In my spark application I would like to do operations on a dataframe in a loop and write the result to hdfs.
pseudocode:
var df = emptyDataframe
for n = 1 to 200000{
someDf=read(n)
df = df.mergeWith(somedf)
}
df.writetohdfs
In the above example I get good results when "mergeWith" does a unionAll.
However, when in "mergeWith" I do a (simple) join, the job gets really slow (>1h with 2 executors with 4 cores each) and never finishes (job aborts itself).
In my scenario I throw in ~50 iterations with files that just contain ~1mb of text data.
Because order of merges is important in my case, I'm suspecting this is due to the DAG generation, causing the whole thing to be run at the moment I store away the data.
Right now I'm attempting to use a .persist on the merged data frame but that also seems to go rather slowly.
EDIT:
As the job was running i noticed (even though I did a count and .persist) the dataframe in memory didn't look like a static dataframe.
It looked like a stringed together path to all the merges it had been doing, effectively slowing down the job linearly.
Am I right to assume the var df is the culprit of this?
breakdown of the issue as I see it:
dfA = empty
dfC = dfA.increment(dfB)
dfD = dfC.increment(dfN)....
When I would expect DF' A C and D are object, spark things differently and does not care if I persist or repartition or not.
to Spark it looks like this:
dfA = empty
dfC = dfA incremented with df B
dfD = ((dfA incremented with df B) incremented with dfN)....
Update2
To get rid of the persisting not working on DF's I could "break" the lineage when converting the DF to and RDD and back again.
This has a little bit of an overhead but an acceptable one (job finishes in minutes rather than hours/never)
I'll run some more tests on the persisting and formulate an answer in the form of a workaround.
Result:
This only seems to fix these issues on the surface. In reality I'm back at square one and get OOM exceptionsjava.lang.OutOfMemoryError: GC overhead limit exceeded

If you have code like this:
var df = sc.parallelize(Seq(1)).toDF()
for(i<- 1 to 200000) {
val df_add = sc.parallelize(Seq(i)).toDF()
df = df.unionAll(df_add)
}
Then df will have 400000 partitions afterwards, which makes the following actions inefficient (because you have 1 tasks for each partition).
Try to reduce the number of partitions to e.g. 200 before persisiting the dataframe (using e.g. df.coalesce(200).write.saveAsTable(....))

So the following is what I ended up using. It's performant enough for my usecase, it works and does not need persisting.
It is very much a workaround rather than a fix.
val mutableBufferArray = ArrayBuffer[DataFrame]()
mutableBufferArray.append(hiveContext.emptyDataframe())
for loop {
val interm = mergeDataFrame(df, mutableBufferArray.last)
val intermSchema = interm.schema
val intermRDD = interm.rdd.repartition(8)
mutableBufferArray.append(hiveContext.createDataFrame(intermRDD, intermSchema))
mutableBufferArray.remove(0)
}
This is how I wrestle tungsten into compliance.
By going from a DF to an RDD and back I end up with a real object rather than a whole tungsten generated process pipe from front to back.
In my code I iterate a few times before writing out to disk (50-150 iterations seem to work best). That's where I clear out the bufferArray again to start over fresh.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why does calling cache take a long time on a Spark Dataset? - scala

Related

Spark - Stategy for persisting derived dataframes when parent DF is already persisted

How to keep RDD persisted and consistent?

How persistence works in Spark

Dataset.unpersist() unexpectedly affects count of other RDD's

Recursive Dataframe operations

Categories

Resources