Spark SQL doesn't seem to be able to push down LIMIT operators on inner joined tables. This is an issue when joining large tables to extract a small subset of rows. I'm testing on Spark 2.2.1 (most recent release).
Below is a contrived example, which runs in the spark-shell (Scala):
First, set up the tables:
case class Customer(id: Long, name: String, email: String, zip: String)
case class Order(id: Long, customer: Long, date: String, amount: Long)
val customers = Seq(
Customer(0, "George Washington", "gwashington#usa.gov", "22121"),
Customer(1, "John Adams", "gwashington#usa.gov", "02169"),
Customer(2, "Thomas Jefferson", "gwashington#usa.gov", "22902"),
Customer(3, "James Madison", "gwashington#usa.gov", "22960"),
Customer(4, "James Monroe", "gwashington#usa.gov", "22902")
)
val orders = Seq(
Order(1, 1, "07/04/1776", 23456),
Order(2, 3, "03/14/1760", 7850),
Order(3, 2, "05/23/1784", 12400),
Order(4, 3, "09/03/1790", 6550),
Order(5, 4, "07/21/1795", 2550),
Order(6, 0, "11/27/1787", 1440)
)
import spark.implicits._
val customerTable = spark.sparkContext.parallelize(customers).toDS()
customerTable.createOrReplaceTempView("customer")
val orderTable = spark.sparkContext.parallelize(orders).toDS()
orderTable.createOrReplaceTempView("order")
Now run the following join query, with a LIMIT and an arbitrary filter for each joined table:
scala> val join = spark.sql("SELECT c.* FROM customer c JOIN order o ON c.id = o.customer WHERE c.id > 1 AND o.amount > 5000 LIMIT 1")
Then print the corresponding optimized execution plan:
scala> println(join.queryExecution.sparkPlan.toString)
CollectLimit 1
+- Project [id#5L, name#6, email#7, zip#8]
+- SortMergeJoin [id#5L], [customer#17L], Inner
:- Filter (id#5L > 1)
: +- SerializeFromObject [assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).id AS id#5L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).name, true) AS name#6, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).email, true) AS email#7, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line14.$read$$iw$$iw$Customer, true]).zip, true) AS zip#8]
: +- Scan ExternalRDDScan[obj#4]
+- Project [customer#17L]
+- Filter ((amount#19L > 5000) && (customer#17L > 1))
+- SerializeFromObject [assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).id AS id#16L, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).customer AS customer#17L, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).date, true) AS date#18, assertnotnull(input[0, $line15.$read$$iw$$iw$Order, true]).amount AS amount#19L]
+- Scan ExternalRDDScan[obj#15]
and you can immediately see that both tables are sorted in their entirety before being joined (although for these small example tables the explicit Sort step is not shown before the SortMergeJoin), and only afterward is the LIMIT applied.
If one of the databases contains billions of rows, this query becomes extremely slow and resource intensive, regardless of the LIMIT size.
Is Spark capable of optimizing such a query? Or can I work around the issue without mangling my SQL beyond recognition?
Spark 3.0 adds LimitPushDown so from that version it will be able to do that.
Is Spark capable of optimizing such a query
In short, it is not.
Using old nomenclature, join is a wide a transformation, where each output partition depends on every upstream partition. As a result, both parent datasets have to be fully scanned, to compute even a single partition of the child.
It is not impossible that some optimizations will be included in the future, however if you goal is to:
extract a small subset of rows.
then you should consider using database, not Apache Spark.
Not sure whether this will be helpful in your use case, but did you consider adding the filter and limit clause in order table itself.
val orderTableLimited = orderTable.filter($"customer" >
1).filter($"amount" > 5000).limit(1)
Related
I have a DataFrame with the columns:
field1, field1_name, field3, field5, field4, field2, field6
I am selecting it so that I only keep field1, field2, field3, field4. Note that there is no field5 after the select.
After that, I have a filter that uses field5 and I would expect it to throw an analysis error since the column is not there, but instead it is filtering the original DataFrame (before the select) because it is pushing down the filter, as shown here:
== Parsed Logical Plan ==
'Filter ('field5 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Analyzed Logical Plan ==
field1: string, field2: string, field3: string, field4: string
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (field5#46 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47, field5#46]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Optimized Logical Plan ==
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (isnotnull(field5#46) && (field5#46 = 22))
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Physical Plan ==
*Project [field1#43, field2#48, field3#45, field4#47]
+- *Filter (isnotnull(field5#46) && (field5#46 = 22))
+- *FileScan csv [field1#43,field3#45,field5#46,field4#47,field2#48] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/..., PartitionFilters: [], PushedFilters: [IsNotNull(field5), EqualTo(field5,22)], ReadSchema: struct<field1:string,field3:string,field5:string,field4:stri...
As you can see the physical plan has the filter before the project... Is this the expected behaviour? I would expect an analysis exception instead...
A reproducible example of the issue:
val df = Seq(
("", "", "")
).toDF("field1", "field2", "field3")
val selected = df.select("field1", "field2")
val shouldFail = selected.filter("field3 == 'dummy'") // I was expecting this filter to fail
shouldFail.show()
Output:
+------+------+
|field1|field2|
+------+------+
+------+------+
The documentation on the Dataset/Dataframe describes the reason for what you are observing quite well:
"Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. "
The important part is highlighted in bold. When applying select and filter statements it just gets added to a logical plan that gets only parsed by Spark when an action is applied. When parsing this full logical plan, the Catalyst Optimizer looks at the whole plan and one of the optimization rules is to push down filters, which is what you see in your example.
I think this is a great feature. Even though you are not interested in seeing this particular field in your final Dataframe, it understands that you are not interested in some of the original data.
That is the main benefit of Spark SQL engine as opposed to RDDs. It understands what you are trying to do without being told how to do it.
I'm trying to join two large Spark dataframes using Scala and I can't get it to perform well. I really hope someone can help me.
I have the following two text files:
dfPerson.txt (PersonId: String, GroupId: String) 2 million rows (100MB)
dfWorld.txt (PersonId: String, GroupId: String, PersonCharacteristic: String) 30 billion rows (1TB)
First I parse the text files to parquet and partition on GroupId, which has 50 distinct values and a rest group.
val dfPerson = spark.read.csv("input/dfPerson.txt")
dfPerson.write.partitionBy("GroupId").parquet("output/dfPerson")
val dfWorld = spark.read.csv("input/dfWorld.txt")
dfWorld.write.partitionBy("GroupId").parquet("output/dfWorld")
Note: a GroupId can contain 1 PersonId up to 6 billion PersonIds, so since it is skewed it might not be the best partition column but it is all I could think of.
Next I read the parquet files and join them, I took the following approaches:
Approach 1: Basic spark join operation
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This however didn't perform well and shuffled over 1 TB of data.
Approach 2: Broadcast hash join
Since dfPerson might be small enough to hold in memory I thought this approach might solve my problem
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
broadcast(dfPerson).as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This also didn't perform well and also shuffled over 1 TB of data which makes me believe the broadcast didn't work?
Approach 3: Bucket and sort the dataframe
I first try to bucket and sort the dataframes before writing to parquet and then join:
val dfPersonInput = spark.read.csv("input/dfPerson.txt")
dfPersonInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfPerson")
.saveAsTable("dfPerson")
val dfPerson = spark.table("dfPerson")
val dfWorldInput = spark.read.csv("input/dfWorld.txt")
dfWorldInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfWorld")
.saveAsTable("dfWorld")
val dfWorld = spark.table("dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
With the following execution plan:
== Physical Plan ==
*(5) Project [PersonId#743]
+- SortMergeJoin [GroupId#73, PersonId#71], [GroupId#745, PersonId#743], RightOuter
:- *(2) Sort [GroupId#73 ASC NULLS FIRST, PersonId#71 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(GroupId#73, PersonId#71, 200)
: +- *(1) Project [PersonId#71, PersonCharacteristic#72, GroupId#73]
: +- *(1) Filter isnotnull(PersonId#71)
: +- *(1) FileScan parquet default.dfWorld[PersonId#71,PersonCharacteristic#72,GroupId#73] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/F:/Output/dfWorld..., PartitionCount: 52, PartitionFilters: [isnotnull(GroupId#73)], PushedFilters: [IsNotNull(PersonId)], ReadSchema: struct<PersonId:string,PersonCharacteristic:string>, SelectedBucketsCount: 4 out of 4
+- *(4) Sort [GroupId#745 ASC NULLS FIRST, PersonId#743 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(GroupId#745, PersonId#743, 200)
+- *(3) FileScan parquet default.dfPerson[PersonId#743,GroupId#745] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/F:/Output/dfPerson], PartitionCount: 45, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<PersonId:string,GroupId:string>, SelectedBucketsCount: 4 out of 4
Also this didn't perform well.
To conclude
All approaches take approximately 150-200 hours (based on the progress on stages and tasks in the spark jobs after 24 hours) and follow the following strategy:
DAG visualization
I guess there is something I'm missing with either the partitioning, bucketing, sorting parquet, or all of them.
Any help would be greatly appreciated.
What is the goal you're trying to achieve?
Why do you need to have it joined?
Join for a sake of join will take you nowhere, unless you have enough memory/disk space to collect 1TB x 100MB worth of data
Edited based on response
If you only need records related to persons that are presented in dfPerson then you don't need right/left join, inner join would be what you want.
Broadcast will only work if your DF is less than broadcast settings in your Spark (10 Mb by default), it's ignored otherwise.
dfPerson.as("p").join(
dfWorld.select(
$"GroupId", $"PersonId",
$"<feature1YouNeed>", $"<feature2YouNeed>"
).as("w"),
Seq("GroupId", "PersonId")
)
This should give you feature you're up to
NB: Replace < feature1YouNeed > and < feature2YouNeed > with actual column names.
As far as I understand, this is the most efficient way to calculate average in Spark: Spark : Average of values instead of sum in reduceByKey using Scala.
My question is: if I use the high-level dataset with a groupby followed by Spark functions' avg(), will I get the same RDD under the hood? Can I trust Catalyst or I should use the low-level RDD? I mean, will writing low-level code yield better results than a dataset?
Example code:
employees
.groupBy($"employee")
.agg(
avg($"salary").as("avg_salary")
)
Versus:
employees
.mapValues(employee => (employee.salary, 1)) // map entry with a count of 1
.reduceByKey {
case ((sumL, countL), (sumR, countR)) =>
(sumL + sumR, countL + countR)
}
.mapValues {
case (sum , count) => sum / count
}
I don't see it a black-and-white question. In general, if you have a RDD, especially if it's a PairRDD, and need a result in RDD, it would make sense to settle with reduceByKey. On the other hand, given a DataFrame I would recommend going with groupBy/agg(avg).
A couple of things to consider:
Built-in optimization
While reduceByKey is relatively efficient compared to functions like groupByKey, it does induce stage boundaries since the operation requires repartitioning the data by keys. Depending on the RDD's partitions, the number of tasks in the derived stage may end up to be too small to take advantage of the available cpu cores, potentially resulting in a performance bottleneck. Such performance issue could be addressed, for instance, by manually assigning numPartition in reduceByKey:
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
But the point is that, to fully optimize RDD operations, one might need to put in some manual tweaking effort. In contrary, most operations for DataFrames are automatically optimized by the built-in Catalyst query optimizer.
Memory usage efficiency
Perhaps the other more significant factor to be looked at is related to memory usage for large dataset. When a RDD needs to be distributed across nodes or written to disk, Spark will serialize every row of data into objects, subject to costly Garbage Collection overhead. On the other hand, with knowledge of a DataFrame's schema, Spark doesn't need to serialize the data into objects. The Tungsten execution engine can leverage off-heap memory to store data in binary format for transformations, resulting in more efficient use of memory.
In conclusion, while there may be more knobs for tweaking using low-level code, that does not necessarily result in more performant code due to inadequate optimization, additional cost for serialization, etc.
We can conclude this from the plan generated by Spark.
This is the plan for DataFrame syntax-
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees
.groupBy($"employee")
.agg(
avg($"salary").as("avg_salary")
).explain(true)
Plan -
== Parsed Logical Plan ==
'Aggregate ['employee], [unresolvedalias('employee, None), avg('salary) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, avg_salary: double
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Optimized Logical Plan ==
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- LocalRelation [employee#4, salary#5]
== Physical Plan ==
*(2) HashAggregate(keys=[employee#4], functions=[avg(salary#5)], output=[employee#4, avg_salary#11])
+- Exchange hashpartitioning(employee#4, 10)
+- *(1) HashAggregate(keys=[employee#4], functions=[partial_avg(salary#5)], output=[employee#4, sum#17, count#18L])
+- LocalTableScan [employee#4, salary#5]
As the plan suggests first "HashAggregate" happened with partial average then "exchange hashpartitioning" happened for full average. The conclusion is that catalyst optimized the DataFrame operation as if we programmed with "reduceByKey" syntax. So we needn't take the burden of writing low level code.
Here is how RDD code and plan looks like.
employees
.map(employee => ("key",(employee.getAs[Double]("salary"), 1))) // map entry with a count of 1
.rdd.reduceByKey {
case ((sumL, countL), (sumR, countR)) =>
(sumL + sumR, countL + countR)
}
.mapValues {
case (sum , count) => sum / count
}.toDF().explain(true)
Plan -
== Parsed Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Analyzed Logical Plan ==
_1: string, _2: double
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- ExternalRDD [obj#29]
== Physical Plan ==
*(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- Scan[obj#29]
The plan is optimized and also involves serialization of data into objects which means extra pressure of memory.
Conclusion
I would use daraframe syntax for its simplicity and possibly better performance.
Debug at println("done") , Go to http://localhost:4040/stages/ ,You will get the result.
val spark = SparkSession
.builder()
.master("local[*]")
.appName("example")
.getOrCreate()
val employees = spark.createDataFrame(Seq(("employee1",1000),("employee2",2000),("employee3",1500))).toDF("employee","salary")
import spark.implicits._
import org.apache.spark.sql.functions._
// Spark functions
employees
.groupBy("employee")
.agg(
avg($"salary").as("avg_salary")
).show()
// your low-level code
println("done")
When computing statistics for a simple parallelized collection in Spark 2.3.0 I'm getting some strange results:
val df = spark.sparkContext.parallelize(Seq("y")).toDF("y")
df.queryExecution.stringWithStats
== Optimized Logical Plan
== Project [value#7 AS y#9], Statistics(sizeInBytes=8.0 EB, hints=none)
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7], Statistics(sizeInBytes=8.0 EB, hints=none)
+- ExternalRDD [obj#6], Statistics(sizeInBytes=8.0 EB, hints=none)
That is 8.0 Exabytes of data.
If I do the equivalent without parallelizing
== Optimized Logical Plan
== LocalRelation [x#3], Statistics(sizeInBytes=20.0 B, hints=none)
Clearly when the collection is getting serialized there is a side effect that the query planner can't accurately figure out the size. Am I missing something here?
I have a Dataset that i created from a RDD and try to join it with another Dataset which is created from my Phoenix Table:
val dfToJoin = sparkSession.createDataset(rddToJoin)
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
When i execute it, it seems that the whole database table is loaded to do the join.
Is there a way to do such a join so that the filtering is done on the database instead of in spark?
Also: dfToJoin is smaller than the table, i do not know if this is important.
Edit: Basically i want to join my Phoenix table with an Dataset created through spark, without fetching the whole table into the executor.
Edit2: Here is the physical plan:
*Project [FEATURE#21, SEQUENCE_IDENTIFIER#22, TAX_NUMBER#23,
WINDOW_NUMBER#24, uniqueIdentifier#5, readLength#6]
+- *SortMergeJoin [FEATURE#21], [feature#4], Inner
:- *Sort [FEATURE#21 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(FEATURE#21, 200)
: +- *Filter isnotnull(FEATURE#21)
: +- *Scan PhoenixRelation(FEATURES,localhost,false)
[FEATURE#21,SEQUENCE_IDENTIFIER#22,TAX_NUMBER#23,WINDOW_NUMBER#24]
PushedFilters: [IsNotNull(FEATURE)], ReadSchema:
struct<FEATURE:int,SEQUENCE_IDENTIFIER:string,TAX_NUMBER:int,
WINDOW_NUMBER:int>
+- *Sort [feature#4 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(feature#4, 200)
+- *Filter isnotnull(feature#4)
+- *SerializeFromObject [assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).feature AS feature#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).uniqueIdentifier, true) AS uniqueIdentifier#5, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).readLength AS readLength#6]
+- Scan ExternalRDDScan[obj#3]
As you can see the equals-filter is not contained in the pushed-filters list, so it is obvious that no predicate pushdown is happening.
Spark will fetch the Phoenix table records to appropriate executors(not the entire table to one executor)
As the is no direct filter on Phoenix table df, we see only *Filter isnotnull(FEATURE#21) in physical plan.
As you are mentioning Phoenix table data is less when you apply filter on it. You push the filter to phoenix table on feature column by finding feature_ids in other dataset.
//This spread across workers - fully distributed
val dfToJoin = sparkSession.createDataset(rddToJoin)
//This sits in driver - not distributed
val list_of_feature_ids = dfToJoin.dropDuplicates("feature")
.select("feature")
.map(r => r.getString(0))
.collect
.toList
//This spread across workers - fully distributed
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
.filter($"FEATURE".isin(list_of_feature_ids:_*)) //added filter
//This spread across workers - fully distributed
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
joinedDf.explain()