Reusing a joined dataframe in Spark - scala

I am running HDFS and Spark locally and trying to understand how Spark persistence works. My objective is to store a joined dataset in memory and then run queries against it on the fly. However, my queries seem to be redoing the join rather than simply scanning through the persisted pre-joined dataset.
I have created and persisted two dataframes, let's say df1 and df2, by loading in two CSV files from HDFS. I persist a join of the two dataframes in memory:
val result = df1.join(df2, "USERNAME")
result.persist()
result.count()
I then define some operations on top of result:
val result2 = result.select("FOO", "BAR").groupBy("FOO").sum("BAR")
result2.show()
'result2' does not piggy back on the persisted result and redoes the join on its own. Here are the physical plans for result and result2:
== Physical Plan for result ==
InMemoryColumnarTableScan [...], (InMemoryRelation [...], true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject [...]), None)
== Physical Plan for result2 ==
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Final,isDistinct=false)], output=[FOO#2,sum(BAR)#837])
TungstenExchange hashpartitioning(FOO#2)
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Partial,isDistinct=false)], output=[FOO#2,currentSum#1311])
InMemoryColumnarTableScan [FOO#2,BAR#10], (InMemoryRelation [...], true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject [...]), None)
I would naively assume that since the join is already done and partitioned in memory, the second operation would simply consist of aggregation operations on each partition. It should be more expensive to redo the join from scratch. Am I assuming incorrectly or doing something wrong? Also, is this the right pattern for retaining a joined dataset for later querying?
Edit: For the record, the second query became a lot more performant after I turned down the number of shuffle partitions. By default, spark.sql.shuffle.partitions is set to 200. Simply setting it to one on my local instance considerably improved performance.

If we look at the the plan, we'll see that Spark actually is making use of the cached data and not redoing the join. Starting from the bottom up:
This is Spark reading the data from your cache:
InMemoryColumnarTableScan [FOO#2,BAR#10], (InMemoryRelation ...
This is Spark aggregating BAR by FOO in each partition - look for mode=Partial
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Partial ...
This is Spark shuffling the data from each partition of the previous step:
TungstenExchange hashpartitioning(FOO#2)
This is Spark aggregating the shuffled partition sums - look for mode=Final
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Final ...
Reading these plans is a bit of a pain so if you have access to the SQL tab of the Spark UI (I think 1.5+), I'd recommend using that instead.

Related

databricks partitioning w/ relation to predicate pushdown

I have searched a lot for a succinct answer, hopefully someone can help me with some clarity on databricks partitioning..
assume i have a data frame with columns: Year, Month, Day, SalesAmount, StoreNumber
I want to store this partitioned by Year, & Month.. so i can run the following command:
df.write.partitionBy('Year', 'Month').format('csv').save('/mnt/path/', header='true')
This will output data in the format of: /path/Year=2019/Month=05/<file-0000x>.csv
If i then load it back again, such as:
spark.read.format('csv').options(header='true').load('/mnt/path/').createOrReplaceTempView("temp1")
Q1: This has not yet actually 'read' the data yet, right? i.e. i could have billions of records.. but until i actually query temp1, nothing is executed against the source?
Q2-A: Subsequently, when querying this data using temp1, it is my assumption that if i include the items that were used in the partitioning in the where clause, a smart filtering on the actual files that are read off the disk will be applied?
%sql
select * from temp1 where Year = 2019 and Month = 05 -- OPTIMAL
whereas the following would not do any file filtering as it has no context of which partitions to look in:
%sql
select * from temp1 where StoreNum = 152 and SalesAmount > 10000 -- SUB-OPTIMAL
Q2-B: Finally, if i stored the files in parquet format (rather than *.csv).. would both of the queries above 'push down' in to the actual data stored.. but in perhaps different ways?
I.e. the first would still use the partitions, but the second (where StoreNum = 152 and SalesAmount > 10000) will now use columnar storage of parquet? While *.csv does not have that optimisation?
Can anyone please clarify my thinking / understanding around this?
links to resources would be great also..
A1: You are right about the evaluation of createOrReplaceTempView. This will be evaluated lazily for the current Spark session. In other word if you terminate Spark session without accessing it the data will never be transfered into temp1.
A2: Let's examine the case through an example using your code. First let's save your data with:
df.write.mode("overwrite").option("header", "true")
.partitionBy("Year", "Month")
.format("csv")
.save("/tmp/partition_test1/")
And then load it with:
val df1 = spark.read.option("header", "true")
.csv("/tmp/partition_test1/")
.where($"Year" === 2019 && $"Month" === 5)
Executing df1.explain will return:
== Physical Plan ==
*(1) FileScan csv [Day#328,SalesAmount#329,StoreNumber#330,Year#331,Month#332] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/partition_test1], PartitionCount: 0, Partition
Filters: [isnotnull(Year#331), isnotnull(Month#332), (Year#331 = 2019), (Month#332 = 5)], PushedFilters: [], ReadSchema: struct<Day:string,SalesAmount:string,StoreNumber:string>
As you can see the PushedFilters: [] array is empty although the PartitionFilters[] is not, indicating that Spark was able to apply filtering on partitions and therefore pruning the partitions that do not satisfy the where statement.
If we slightly change the Spark query to:
df1.where($"StoreNumber" === 1 && $"Year" === 2011 && $"Month" === 11).explain
== Physical Plan ==
*(1) Project [Day#462, SalesAmount#463, StoreNumber#464, Year#465, Month#466]
+- *(1) Filter (isnotnull(StoreNumber#464) && (cast(StoreNumber#464 as int) = 1))
+- *(1) FileScan csv [Day#462,SalesAmount#463,StoreNumber#464,Year#465,Month#466] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/partition_test1], PartitionCount: 1, Par
titionFilters: [isnotnull(Month#466), isnotnull(Year#465), (Year#465 = 2011), (Month#466 = 11)], PushedFilters: [IsNotNull(StoreNumber)], ReadSchema: struct<Day:string,SalesAmount:string,Store
Number:string>
Now both PartitionFilters and PushedFilters will take place minimizing Spark workload. As you can see Spark leverages both filters first by recognizing the existing partitions through PartitionFilters and then applying the predicate pushdown.
Exactly the same applies for parquet files with the big difference that parquet will utilize the predicate pushdown filters even more combining them with its internal columnar based system (as you already mentioned), which keeps metrics and stats over the data. So the difference with CSV files is that in the case of CSVs the predicate pushdown will take place when Spark is reading/scanning the CSV files excluding records that do not satisfy the predicate pushdown condition. When for parquet the predicate pushdown filter will be propagated to the parquet internal system resulting to even larger pruning of data.
In your case loading data from createOrReplaceTempView will not differ and the execution plan will remain the same.
Some useful links:
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
https://www.waitingforcode.com/apache-spark-sql/predicate-pushdown-spark-sql/read
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkStrategy-FileSourceStrategy.html
Q1, when you read csv files without providing a schema then it has to infer the schema and a read happens immediately of all files (possibly it filter the partition at this point if it can).
If you were to provide a schema your assumptions on filtering are correct as are the execution event assumptions.
Q2. Not sure I follow. When you say two queries do you mean above or below? On the below one does a write, how do you expect filtering to work on a write?
If you are referring to the first two queries in parquet then the first will eliminate most files and be very quick. The second will hopefully skip some data by using statistics on in the files to show that it doesn’t need to read them. But it will still touch every file.
You may find this useful https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example

Reading Hive table from Spark as a Dataset

I am trying to read a hive table in spark as a strongly typed Dataset, and I am noticing that the partitions are not being pruned as opposed to doing a Spark SQL on a dataframe from the same hive table.
case class States(state: String, country: String)
val hiveDS = spark.table("db1.states").as[States]
//no partition pruning
hiveDS.groupByKey(x=>x.country).count().filter(x=>x._1 == "US")
states is partitioned by country, so when I do a count on the above Dataset, the query scans all the partitions. However if I read it as such -
val hiveDF = spark.table("db1.states")
//correct partition pruning
hiveDF.groupByKey("country").count().filter(x=>x._1 == "US")
The partitions are pruned correctly. Can anyone explain why partition information is lost when you map a table to a case class?
TL;DR Lack of partition pruning in the first case is the expected behavior.
It happens because any operation on an object, unlike operations used with DataFrame DSL / SQL, is a black box, from the the optimizer perspective. To be able to optimize function like x=> x._1 == "US" or x => x.country Spark would have to apply complex and unreliable static analysis, and functionality like this is neither present nor (as far as I know) planned for the future.
The second case shouldn't compile (there is no groupByKey variant which takes strings), so it is not possible to tell, but in general it shouldn't prune either, unless you meant:
hiveDF.groupBy($"country").count().filter($"country" =!= "US")
See also my answer to to Spark 2.0 Dataset vs DataFrame.

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

How is the performance impact of select statements on Spark DataFrames?

Using many select statements or expressions on Spark DataFrames, I wonder what their performance impact is on subsequent transformations once triggered by an action.
Given a dataframe df with 10 columns a to j.
How is the influence if I use as for column renaming on each column?
df.select( df("a").as("1"), ..., df("j").as("10"))
What if I select a subset (e.g. 5 columns)
val df2 = df.select( df("a"), ..., df("e") )
b. How handles Spark this projection? Is df still kept (as df2 is a projection) so df could serve as kind of reference? Or is instead df2 created freshly and df discarded? (neglecting any persist here)
How is the influence of general Column expressions used in select?
Are performance tests for the above cases available? And are performance measurements in general somewhere available? If not, how to measure the performance best?