databricks partitioning w/ relation to predicate pushdown - pyspark

I have searched a lot for a succinct answer, hopefully someone can help me with some clarity on databricks partitioning..
assume i have a data frame with columns: Year, Month, Day, SalesAmount, StoreNumber
I want to store this partitioned by Year, & Month.. so i can run the following command:
df.write.partitionBy('Year', 'Month').format('csv').save('/mnt/path/', header='true')
This will output data in the format of: /path/Year=2019/Month=05/<file-0000x>.csv
If i then load it back again, such as:
spark.read.format('csv').options(header='true').load('/mnt/path/').createOrReplaceTempView("temp1")
Q1: This has not yet actually 'read' the data yet, right? i.e. i could have billions of records.. but until i actually query temp1, nothing is executed against the source?
Q2-A: Subsequently, when querying this data using temp1, it is my assumption that if i include the items that were used in the partitioning in the where clause, a smart filtering on the actual files that are read off the disk will be applied?
%sql
select * from temp1 where Year = 2019 and Month = 05 -- OPTIMAL
whereas the following would not do any file filtering as it has no context of which partitions to look in:
%sql
select * from temp1 where StoreNum = 152 and SalesAmount > 10000 -- SUB-OPTIMAL
Q2-B: Finally, if i stored the files in parquet format (rather than *.csv).. would both of the queries above 'push down' in to the actual data stored.. but in perhaps different ways?
I.e. the first would still use the partitions, but the second (where StoreNum = 152 and SalesAmount > 10000) will now use columnar storage of parquet? While *.csv does not have that optimisation?
Can anyone please clarify my thinking / understanding around this?
links to resources would be great also..

A1: You are right about the evaluation of createOrReplaceTempView. This will be evaluated lazily for the current Spark session. In other word if you terminate Spark session without accessing it the data will never be transfered into temp1.
A2: Let's examine the case through an example using your code. First let's save your data with:
df.write.mode("overwrite").option("header", "true")
.partitionBy("Year", "Month")
.format("csv")
.save("/tmp/partition_test1/")
And then load it with:
val df1 = spark.read.option("header", "true")
.csv("/tmp/partition_test1/")
.where($"Year" === 2019 && $"Month" === 5)
Executing df1.explain will return:
== Physical Plan ==
*(1) FileScan csv [Day#328,SalesAmount#329,StoreNumber#330,Year#331,Month#332] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/partition_test1], PartitionCount: 0, Partition
Filters: [isnotnull(Year#331), isnotnull(Month#332), (Year#331 = 2019), (Month#332 = 5)], PushedFilters: [], ReadSchema: struct<Day:string,SalesAmount:string,StoreNumber:string>
As you can see the PushedFilters: [] array is empty although the PartitionFilters[] is not, indicating that Spark was able to apply filtering on partitions and therefore pruning the partitions that do not satisfy the where statement.
If we slightly change the Spark query to:
df1.where($"StoreNumber" === 1 && $"Year" === 2011 && $"Month" === 11).explain
== Physical Plan ==
*(1) Project [Day#462, SalesAmount#463, StoreNumber#464, Year#465, Month#466]
+- *(1) Filter (isnotnull(StoreNumber#464) && (cast(StoreNumber#464 as int) = 1))
+- *(1) FileScan csv [Day#462,SalesAmount#463,StoreNumber#464,Year#465,Month#466] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/partition_test1], PartitionCount: 1, Par
titionFilters: [isnotnull(Month#466), isnotnull(Year#465), (Year#465 = 2011), (Month#466 = 11)], PushedFilters: [IsNotNull(StoreNumber)], ReadSchema: struct<Day:string,SalesAmount:string,Store
Number:string>
Now both PartitionFilters and PushedFilters will take place minimizing Spark workload. As you can see Spark leverages both filters first by recognizing the existing partitions through PartitionFilters and then applying the predicate pushdown.
Exactly the same applies for parquet files with the big difference that parquet will utilize the predicate pushdown filters even more combining them with its internal columnar based system (as you already mentioned), which keeps metrics and stats over the data. So the difference with CSV files is that in the case of CSVs the predicate pushdown will take place when Spark is reading/scanning the CSV files excluding records that do not satisfy the predicate pushdown condition. When for parquet the predicate pushdown filter will be propagated to the parquet internal system resulting to even larger pruning of data.
In your case loading data from createOrReplaceTempView will not differ and the execution plan will remain the same.
Some useful links:
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
https://www.waitingforcode.com/apache-spark-sql/predicate-pushdown-spark-sql/read
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SparkStrategy-FileSourceStrategy.html

Q1, when you read csv files without providing a schema then it has to infer the schema and a read happens immediately of all files (possibly it filter the partition at this point if it can).
If you were to provide a schema your assumptions on filtering are correct as are the execution event assumptions.
Q2. Not sure I follow. When you say two queries do you mean above or below? On the below one does a write, how do you expect filtering to work on a write?
If you are referring to the first two queries in parquet then the first will eliminate most files and be very quick. The second will hopefully skip some data by using statistics on in the files to show that it doesn’t need to read them. But it will still touch every file.
You may find this useful https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example

Related

Read file content per row of Spark DataFrame

We have an AWS S3 bucket with millions of documents in a complex hierarchy, and a CSV file with (among other data) links to a subset of those files, I estimate this file will be about 1000 to 10.000 rows. I need to join the data from the CSV file with the contents of the documents for further processing in Spark. In case it matters, we're using Scala and Spark 2.4.4 on an Amazon EMR 6.0.0 cluster.
I can think of two ways to do this. First is to add a transformation on the CSV DataFrame that adds the content as a new column:
val df = spark.read.format("csv").load("<csv file>")
val attempt1 = df.withColumn("raw_content", spark.sparkContext.textFile($"document_url"))
or variations thereof (for example, wrapping it in a udf) don't seem to work, I think because sparkContext.textFile returns an RDD, so I'm not sure it's even supposed to work this way? Even if I get it working, is the best way to keep it performant in Spark?
An alternative I tried to think of is to use spark.sparkContext.wholeTextFiles upfront and then join the two dataframes together:
val df = spark.read.format("csv").load("<csv file>")
val contents = spark.sparkContext.wholeTextFiles("<s3 bucket>").toDF("document_url", "raw_content");
val attempt2 = df.join(contents, df("document_url") === contents("document_url"), "left")
but wholeTextFiles doesn't go into subdirectories and the needed paths are hard to predict, and I'm also unsure of the performance impact of trying to build an RDD of the entire bucket of millions of files if I only need a small fraction of it, since the S3 API probably doesn't make it very fast to list all the objects in the bucket.
Any ideas? Thanks!
I did figure out a solution in the end:
val df = spark.read.format("csv").load("<csv file>")
val allS3Links = df.map(row => row.getAs[String]("document_url")).collect()
val joined = allS3Links.mkString(",")
val contentsDF = spark.sparkContext.wholeTextFiles(joined).toDF("document_url", "raw_content");
The downside to this solution is that it pulls all the urls to the driver, but it's workable in my case (100,000 * ~100 char length strings is not that much) and maybe even unavoidable.

Can't write ordered data to parquet in spark

I am working with Apache Spark to generate parquet files. I can partition them by date with no problems, but internally I can not seem to lay out the data in the correct order.
The order seems to get lost during processing, which means the parquet metadata is not right (specifically I want to ensure that the parquet row groups are reflecting sorted order so that queries specific to my use case can filter efficiently via the metadata).
Consider the following example:
// note: hbase source is a registered temp table generated from hbase
val transformed = sqlContext.sql(s"SELECT id, sampleTime, ... , toDate(sampleTime) as date FROM hbaseSource")
// Repartion the input set by the date column (in my source there should be 2 distinct dates)
val sorted = transformed.repartition($"date").sortWithinPartitions("id", "sampleTime")
sorted.coalesce(1).write.partitionBy("date").parquet(s"/outputFiles")
With this approach, I do get the right parquet partition structure ( by date). And even better, for each date partition, I see a single large parquet file.
/outputFiles/date=2018-01-01/part-00000-4f14286c-6e2c-464a-bd96-612178868263.snappy.parquet
However, when I query the file I see the contents out of order. To be specific, "out of order" seems more like several ordered data-frame partitions have been merged into the file.
The parquet row group metadata shows that the sorted fields are actually overlapping ( a specific id, for example, could be located in many row groups ):
id: :[min: 54, max: 65012, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 827, max: 65470, num_nulls: 0]
sampleTime: :[min: 1514764810000000, max: 1514851190000000, num_nulls: 0]
id: :[min: 1629, max: 61412, num_nulls: 0]
I want the data to be properly ordered inside each file so the metadata min/max in each row group are non-overlapping.
For example, this is the pattern I want to see:
RG 0: id: :[min: 54, max: 100, num_nulls: 0]
RG 1: id: :[min: 100, max: 200, num_nulls: 0]
... where RG = "row group". If I wanted id = 75, the query could find it in one row group.
I have tried many variations of the above code. For example with and without coalesce (I know coalesce is bad, but my idea was to use it to prevent shuffling). I have also tried sort instead of sortWithinPartitions (sort should create a total ordered sort, but will result in many partitions). For example:
val sorted = transformed.repartition($"date").sort("id", "sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Gives me 200 files, which is too many, and they are still not sorted correctly. I can reduce the file count by adjusting the shuffle size, but I would have expected sort to be processed in order during the write (I was under the impression that writes did not shuffle the input). The order I see is as follows (other fields omitted for brevity):
+----------+----------------+
|id| sampleTime|
+----------+----------------+
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
| 57834|1514785180000000|
| 56868|1514840220000000|
Which looks like it's interleaved sorted partitions. So I think repartition buys me nothing here, and sort seems to be incapable of preserving order on the write step.
I've read that what I want to do should be possible. I've even tried the approach outlined in the presentation "Parquet performance tuning:
The missing guide" by Ryan Blue ( unfortunately it is behind the OReily paywall). That involves using insertInto. In that case, spark seemed to use an old version of parquet-mr which corrupted the metadata, and I am not sure how to upgrade it.
I am not sure what I am doing wrong. My feeling is that I am misunderstanding the way repartition($"date") and sort work and/or interact.
I would appreciate any ideas. Apologies for the essay. :)
edit:
Also note that if I do a show(n) on transformed.sort("id", "sampleTime") the data is sorted correctly. So it seems like the problem occurs during the write stage. As noted above, it does seem like the output of the sort is shuffled during the write.
The problem is that while saving file format, Spark is requiring some order and if the order is not satisfied, Spark will sort the data during the saving process according to the requirement and will forget your sort. To be more specific Spark requires this order (and this is taken directly from the Spark source code of Spark 2.4.4):
val requiredOrdering = partitionColumns ++ bucketIdExpression ++ sortColumns
where partitionColumns are columns by which you partition the data. You are not using bucketing so bucketingIdExpression and sortColumns are not relevant in this example and the requiredOrdering will be only the partitionColumns. So if this is your code:
val sorted = transformed.repartition($"date").sortWithinPartitions("id",
"sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Spark will check if the data is sorted by date, which is not, so Spark will forget your sort and will sort it by date. On the other hand if you instead do it like this:
val sorted = transformed.repartition($"date").sortWithinPartitions("date", "id",
"sampleTime")
sorted.write.partitionBy("date").parquet(s"/outputFiles")
Spark will check again if the data is sorted by date and this time it is (the requirement is satisfied) so Spark will preserve this order and will induce no more sorts while saving the data. So i believe this way it should work.
Just idea, sort after coalesce: ".coalesce(1).sortWithinPartitions()". Also expected result looks strange - why ordered data in parquet required? Sorting after reading looks more appropriate.

Caching Large Dataframes in Spark Effectively

I am currently working on 11,000 files. Each file will generate a data frame which will be Union with the previous one. Below is the code:
var df1 = sc.parallelize(Array(("temp",100 ))).toDF("key","value").withColumn("Filename", lit("Temp") )
files.foreach( filename => {
val a = filename.getPath.toString()
val m = a.split("/")
val name = m(6)
println("FILENAME: " + name)
if (name == "_SUCCESS") {
println("Cannot Process '_SUCCSS' Filename")
} else {
val freqs=doSomething(a).toDF("key","value").withColumn("Filename", lit(name) )
df1=df1.unionAll(freqs)
}
})
First, i got an error of java.lang.StackOverFlowError on 11,000 files. Then, i add a following line after df1=df1.unionAll(freqs):
df1=df1.cache()
It resolves the problem but after each iteration, it is getting slower. Can somebody please suggest me what should be done to avoid StackOverflowError with no decrease in time.
Thanks!
The issue is that spark manages a dataframe as a set of transformations. It begins with the "toDF" of the first dataframe, then perform the transformations on it (e.g. withColumn), then unionAll with the previous dataframe etc.
The unionAll is just another such transformation and the tree becomes very long (with 11K unionAll you have an execution tree of depth 11K). The unionAll when building the information can get to a stack overflow situation.
The caching doesn't solve this, however, I imagine you are adding some action along the way (otherwise nothing would run besides building the transformations). When you perform caching, spark might skip some of the steps and therefor the stack overflow would simply arrive later.
You can go back to RDD for iterative process (your example actually is not iterative but purely parallel, you can simply save each separate dataframe along the way and then convert to RDD and use RDD union).
Since your case seems to be join unioning a bunch of dataframes without true iterations, you can also do the union in a tree manner (i.e. union pairs, then union pairs of pairs etc.) this would change the depth from O(N) to O(log N) where N is the number of unions.
Lastly, you can read and write the dataframe to/from disk. The idea is that after every X (e.g. 20) unions, you would do df1.write.parquet(filex) and then df1 = spark.read.parquet(filex). When you read the lineage of a single dataframe would be the file reading itself. The cost of course would be the writing and reading of the file.

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Reusing a joined dataframe in Spark

I am running HDFS and Spark locally and trying to understand how Spark persistence works. My objective is to store a joined dataset in memory and then run queries against it on the fly. However, my queries seem to be redoing the join rather than simply scanning through the persisted pre-joined dataset.
I have created and persisted two dataframes, let's say df1 and df2, by loading in two CSV files from HDFS. I persist a join of the two dataframes in memory:
val result = df1.join(df2, "USERNAME")
result.persist()
result.count()
I then define some operations on top of result:
val result2 = result.select("FOO", "BAR").groupBy("FOO").sum("BAR")
result2.show()
'result2' does not piggy back on the persisted result and redoes the join on its own. Here are the physical plans for result and result2:
== Physical Plan for result ==
InMemoryColumnarTableScan [...], (InMemoryRelation [...], true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject [...]), None)
== Physical Plan for result2 ==
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Final,isDistinct=false)], output=[FOO#2,sum(BAR)#837])
TungstenExchange hashpartitioning(FOO#2)
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Partial,isDistinct=false)], output=[FOO#2,currentSum#1311])
InMemoryColumnarTableScan [FOO#2,BAR#10], (InMemoryRelation [...], true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject [...]), None)
I would naively assume that since the join is already done and partitioned in memory, the second operation would simply consist of aggregation operations on each partition. It should be more expensive to redo the join from scratch. Am I assuming incorrectly or doing something wrong? Also, is this the right pattern for retaining a joined dataset for later querying?
Edit: For the record, the second query became a lot more performant after I turned down the number of shuffle partitions. By default, spark.sql.shuffle.partitions is set to 200. Simply setting it to one on my local instance considerably improved performance.
If we look at the the plan, we'll see that Spark actually is making use of the cached data and not redoing the join. Starting from the bottom up:
This is Spark reading the data from your cache:
InMemoryColumnarTableScan [FOO#2,BAR#10], (InMemoryRelation ...
This is Spark aggregating BAR by FOO in each partition - look for mode=Partial
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Partial ...
This is Spark shuffling the data from each partition of the previous step:
TungstenExchange hashpartitioning(FOO#2)
This is Spark aggregating the shuffled partition sums - look for mode=Final
TungstenAggregate(key=[FOO#2], functions=[(sum(cast(BAR#10 as double)),mode=Final ...
Reading these plans is a bit of a pain so if you have access to the SQL tab of the Spark UI (I think 1.5+), I'd recommend using that instead.