Spark shuffling data before insert - scala

CalcDf().show results into 13 stages(0-12) +1(13) for the show itself.
When I try to write the result to table, I assume there should only be 13 stages(0-12), instead I see and additional stage(13). Where does it come from and what does it do? I'm not performing any repartition or other operation that would require a shuffle. As far as I understand spark should just write 1100 files into the table, but it's not what's happening.
CalcDf()
.write
.mode(SaveMode.Overwrite)
.insertInto("tn")
CalcDf() logic
val dim = spark.sparkContext.broadcast(
spark.table("dim")
.as[Dim]
.map(r => r.id-> r.col)
.collect().toMap
)
spark.table("table")
.as[CustomCC]
.groupByKey(_.id)
.flatMapGroups{case(k, iterator) => CustomCC.mapRows(iterator, dim)}
.withColumn("time_key", lit("2021-07-01"))

Previous stage #12 has done shuffle write so any subsequent stage will have to read data from this via shuffle read (which you notice in #13).
Why is there an additional stage ?
because stage 12 has shuffle write and not an Output
for understanding of stage #12 please give information about how CalDf is built.
EDIT
groupByKey will do shuffle write, for getting same ids on single executor JVM.
stage 13 is reading this shuffled data and computing map operation after.
difference in task count can be attributed to action.
In show(), it hasn't read whole shuffled data. Maybe because it displays 20 rows (default)
whereas in insertInto(...), it is operating on whole data so reading all.
stage #13 is not just because it is writing files but it is actually doing computation.

Related

Spark - Stategy for persisting derived dataframes when parent DF is already persisted

I have not found a clear answer to this question yet, even though there are multiple similar questions in SO.
I don't fill-in all the details for the code below, as the actual transformations are not important for my questions.
// Adding _corrupt_record to have records that are not valid json
val inputDf = spark.read.schema(someSchema.add("_corrupt_record", StringType)).json(path)
/**
* The following lazy-persists the DF and does not return a new DF. Since
* Spark>=2.3 the queries from raw JSON/CSV files are disallowed when the
* referenced columns only include the internal corrupt record column
* (named _corrupt_record by default). Caching is the workaround.
*/
inputDf.persist
val uncorruptedDf = inputDf.filter($"_corrupt_record".isNull)
val corruptedDf = inputDf.filter($"_corrupt_record".isNotNull)
// Doing a count on both derived DFs - corruptedDf will also be output for further investigation
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
corruptedDf.write.json(corruptedOutputPath)
// Not corrupted data will be used for some complicated transformations
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
As you can see, I marked the input dataframe inputDf for persistence (see the reason here), but never did a count on it. Then I derived two dataframes, to both of which I did a count.
Question 1: When I do uncorruptedDf.count, what does it do to the parent dataframe inputdf? Does it trigger caching of the whole inputDf, the part of it that corresponds to uncorruptedDf.count, or nothing? RDD Documentation says that:
When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it).
Question 2: Does it make sense at this point (before the two count) to persist the derived dataframes corruptedDf and uncorruptedDf and unpersist inputDf? Since there are two actions happening on each derived dataframe, I would say yes, but I am not sure. If so.. what is the correct place to unpersist the parent DF below? (A), (B), or (C)?
uncorruptedDf.persist
corruptedDf.persist
// (A) I don't think I should inputDf.unpersist here, since derived DFs are not yet persisted
log.info("Not corrupted records: " + uncorruptedDf.count)
log.info("Corrupted records: " + corruptedDf.count)
// (B) This seems a reasonable place, to free some memory
val finalDf = uncorruptedDf.grouby(...).agg(...)
log.info("Finally chosen records: " + finalDf.count)
finalDf.write.json(outputPath)
// (C) Is there any value from unpersisting here?
Question 3: Same as previous question but for finalDf vs corruptedDf. As can be seen I perform two actions on the finalDf: count and write.
Thanks in advance!
For question 1:
Yes it would persist the inputdf when the first count is called which is uncorrupted.count() but it won't persist any transformation that you do on the inputdf. On next count it won't read the data from the json file but it would read it from the partitions that it cached.
For question 2:
I think you should not persist the inputdf as there is nothing that you gain. Persisting the corrupted and uncorrupted of makes sense as you are performing multiple actions on it. You are just performing transformation on the inputdf to filter corrupt and uncorrupt records and spark is smart enough to combine it as one step during its physical planning stage.To conclude you should not persist inputdf and in that way you do not have to worry about unpersisting it.
For question 3:
You should not persist final dataframe as you are only performing one action on it of writing it to physical path as json file.
PS: don't try to cache/ persist each dataframe as caching itself has performance impact and have to do additional work to keep the data in memory or save to disk based on the storage level that you specify. If there are less transformations and are not complex it better to avoid caching. You can use explain command on the dataframe to see the physical ans logical plans.

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

Recursive Dataframe operations

In my spark application I would like to do operations on a dataframe in a loop and write the result to hdfs.
pseudocode:
var df = emptyDataframe
for n = 1 to 200000{
someDf=read(n)
df = df.mergeWith(somedf)
}
df.writetohdfs
In the above example I get good results when "mergeWith" does a unionAll.
However, when in "mergeWith" I do a (simple) join, the job gets really slow (>1h with 2 executors with 4 cores each) and never finishes (job aborts itself).
In my scenario I throw in ~50 iterations with files that just contain ~1mb of text data.
Because order of merges is important in my case, I'm suspecting this is due to the DAG generation, causing the whole thing to be run at the moment I store away the data.
Right now I'm attempting to use a .persist on the merged data frame but that also seems to go rather slowly.
EDIT:
As the job was running i noticed (even though I did a count and .persist) the dataframe in memory didn't look like a static dataframe.
It looked like a stringed together path to all the merges it had been doing, effectively slowing down the job linearly.
Am I right to assume the var df is the culprit of this?
breakdown of the issue as I see it:
dfA = empty
dfC = dfA.increment(dfB)
dfD = dfC.increment(dfN)....
When I would expect DF' A C and D are object, spark things differently and does not care if I persist or repartition or not.
to Spark it looks like this:
dfA = empty
dfC = dfA incremented with df B
dfD = ((dfA incremented with df B) incremented with dfN)....
Update2
To get rid of the persisting not working on DF's I could "break" the lineage when converting the DF to and RDD and back again.
This has a little bit of an overhead but an acceptable one (job finishes in minutes rather than hours/never)
I'll run some more tests on the persisting and formulate an answer in the form of a workaround.
Result:
This only seems to fix these issues on the surface. In reality I'm back at square one and get OOM exceptionsjava.lang.OutOfMemoryError: GC overhead limit exceeded
If you have code like this:
var df = sc.parallelize(Seq(1)).toDF()
for(i<- 1 to 200000) {
val df_add = sc.parallelize(Seq(i)).toDF()
df = df.unionAll(df_add)
}
Then df will have 400000 partitions afterwards, which makes the following actions inefficient (because you have 1 tasks for each partition).
Try to reduce the number of partitions to e.g. 200 before persisiting the dataframe (using e.g. df.coalesce(200).write.saveAsTable(....))
So the following is what I ended up using. It's performant enough for my usecase, it works and does not need persisting.
It is very much a workaround rather than a fix.
val mutableBufferArray = ArrayBuffer[DataFrame]()
mutableBufferArray.append(hiveContext.emptyDataframe())
for loop {
val interm = mergeDataFrame(df, mutableBufferArray.last)
val intermSchema = interm.schema
val intermRDD = interm.rdd.repartition(8)
mutableBufferArray.append(hiveContext.createDataFrame(intermRDD, intermSchema))
mutableBufferArray.remove(0)
}
This is how I wrestle tungsten into compliance.
By going from a DF to an RDD and back I end up with a real object rather than a whole tungsten generated process pipe from front to back.
In my code I iterate a few times before writing out to disk (50-150 iterations seem to work best). That's where I clear out the bufferArray again to start over fresh.

Why does RDD.groupBy return an empty RDD if the initial RDD wasn't empty?

I have an RDD that I've used to load binary files. Each file is broken into multiple parts and processed. After the processing step, each entry is:
(filename, List[Results])
Since the files are broken into several parts, the filename is the same for several entries in the RDD. I'm trying to put the results for each part back together using reduceByKey. However, when I attempt to run a count on this RDD it returns 0:
val reducedResults = my_rdd.reduceByKey((resultsA, resultsB) => resultsA ++ resultsB)
reducedResults.count() // 0
I've tried changing the key it uses with no success. Even with extremely simple attempts to group the results I don't get any output.
val singleGroup = my_rdd.groupBy((k, v) => 1)
singleGroup.count() // 0
On the other hand, if I simply collect the results, then I can group them outside of Spark and everything works fine. However, I still have additional processing that I need to do on the collected results, so that isn't a good option.
What could cause the groupBy/reduceBy commands to return empty RDDs if the initial RDD isn't empty?
Turns out there was a bug in how I was generating the Spark configuration for that particular job. Instead of setting the spark.default.parallelism field to something reasonable, it was being set to 0.
From the Spark documentation on spark.default.parallelism:
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
So while an operation like collect() worked perfectly fine, any attempt to reshuffle the data without specifying the number of partitions gave me an empty RDD. That'll teach me to trust old configuration code.

Apache Spark RDD - not updating

I create a PairRDD which contains a Vector.
var newRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
Later on I update the RDD:
newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)
However, although it outputs an updated Vector (as shown in the console), when I next call newRDD I can see the Vector value has changed. Through testing I have concluded that it has changed to something given by math.random - as every time I call newRDD the Vector changes. I understand there is a lineage graph and maybe that has something to do with it. I need to update the Vector held in the RDD to new values and I need to do this repeatedly.
Thanks.
RDD are immutable structures meant to distribute operations on data over a cluster.
There're two elements playing a role in the behavior you are observing here:
RDD lineage may be computed every time. In this case, it means that an action on newRDD might trigger the lineage computation, therefore applying the Vector(Array.fill(2){math.random}) transformation and resulting in new values each time. The lineage can be broken using cache, in which case the value of the transformation will be kept in memory and/or disk after the first time it's applied.
This results in:
val randomVectorRDD = oldRDD.mapValues(listOfItemsAndRatings => Vector(Array.fill(2){math.random}))
randomVectorRDD.cache()
The second aspect that needs further consideration is the on-site mutation:
newRDD.lookup(ratingObject.user)(0) += 0.2 * (errorRate(rating) * myVector)
Although this might work on a single machine because all Vector references are local, it will not scale to a cluster as lookup references will be serialized and mutations will not be preserved. Therefore it bears the question of why use Spark for this.
To be implemented on Spark, this algorithm will need re-design in order to be expressed in terms of transformations instead of punctual lookup/mutations.