Apache Spark - Does dataset.dropDuplicates() preserve partitioning? - scala

I know that there exist several transformations which preserve parent partitioning (if it was set before - e.g. mapValues) and some which do not preserve it (e.g. map).
I use Dataset API of Spark 2.2. My question is - does dropDuplicates transformation preserve partitioning? Imagine this code:
case class Item(one: Int, two: Int, three: Int)
import session.implicits._
val ds = session.createDataset(List(Item(1,2,3), Item(1,2,3)))
val repart = ds.repartition('one, 'two).cache()
repart.dropDuplicates(List("one", "two")) // will be partitioning preserved?

generally, dropDuplicates does a shuffle (and thus not preserve partitioning), but in your special case it does NOT do an additional shuffle because you have already partitioned the dataset in a suitable form which is taken into account by the optimizer:
repart.dropDuplicates(List("one","two")).explain()
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- *HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange hashpartitioning(one#3, two#4, 200)
+- LocalTableScan [one#3, two#4, three#5]
the keyword to look for here is : Exchange
But consider the following code where you first repartition the dataset using plain repartition():
val repart = ds.repartition(200).cache()
repart.dropDuplicates(List("one","two")).explain()
This will indeed trigger an additional shuffle ( now you have 2 Exchange steps):
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4], functions=[first(three#5, false)])
+- Exchange hashpartitioning(one#3, two#4, 200)
+- *HashAggregate(keys=[one#3, two#4], functions=[partial_first(three#5, false)])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange RoundRobinPartitioning(200)
+- LocalTableScan [one#3, two#4, three#5]
NOTE: I checked that with Spark 2.1, it may be different in Spark 2.2 because the optimizer changed in Spark 2.2 (Cost-Based Optimizer)

No, dropDuplicates doesn't preserve partitions since it has a shuffle boundary, which doesn't guarantee order.
dropDuplicates is approximately:
ds.groupBy(columnId).agg(/* take first column from any available partition */)

Related

Avoid shuffle when joining two tables with same partitions without bucketting strategy

I have two Hive tables A and B with:
same partitions (partition_1, partition_2)
an extra id field that is not sorted in partitions
When I join these two tables in PySpark, for example with:
df_A = spark.table("db.A")
df_B = spark.table("db.B")
df = df_A.join(df_B, how="inner", on=["partition_1", "partition_2", "id"])
I always end up with a shuffle:
+- == Initial Plan ==
Project (23)
+- SortMergeJoin Inner (22)
:- Sort (18)
: +- Exchange (17)
: +- Filter (16)
: +- Scan parquet db.A (15)
+- Sort (21)
+- Exchange (20)
+- Filter (19)
+- Scan parquet db.B (7)
I created two similar tables but with a bucketting strategy this time:
df.write.partitionBy("partition_A", "partition_B").bucketBy(10, "id").saveAsTable(...)
And there is no more shuffle in the join
+- == Initial Plan ==
Project (17)
+- SortMergeJoin Inner (16)
:- Sort (13)
: +- Filter (12)
: +- Scan parquet db.A (11)
+- Sort (15)
+- Filter (14)
+- Scan parquet db.B (5)
My questions are:
Can I avoid this shuffle in the join without having to re-create the tables with a bucketting strategy ?
Does this shuffle operate on all data ? Or does it consider that the partitions are the same and optimise this shuffle ?
What I tried so far:
repartitioning on partitions (df.repartition("partition_A", "partition_B")) on both tables before joining
repartitioning on partitions and id field (df.repartition(numPartitions, "partition_A", "partition_B", "id"))
sorting data by id before joining
But the shuffle is still here.
I tried on both Databricks and EMR runtimes with same behaviour.
Thanks for your help

Aggregation after sort(), persist() and limit() in Spark

I'm trying to get the sum of a column of the top n rows in a persisted DataFrame. For some reason, the following doesn't work:
val df = df0.sort(col("colB").desc).persist()
df.limit(2).agg(sum("colB")).show()
It shows a random number which is clearly less than the sum of the top two. The number changes from run-to-run.
Calling show() on the limit()'ed DF does consistently show the correct top two values:
df.limit(2).show()
It is as if sort() doesn't apply before the aggregation. Is this a bug in Spark? I suppose it's kind of expected that persist() loses the sorting, but why does it work with show() and should this be documented somewhere?
See the query plans below. agg results in an exchange (4th line in physical plan) which removes the sorting, whereas show does not result in any exchange, so sorting is maintained.
scala> df.limit(2).agg(sum("colB")).explain()
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(2) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(2) GlobalLimit 2
+- Exchange SinglePartition, true, [id=#95]
+- *(1) LocalLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
scala> df.limit(2).explain()
== Physical Plan ==
CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
But if you persist the limited dataframe, there won't be any exchange for the aggregation, so that might do the trick:
val df1 = df.limit(2).persist()
scala> df1.agg(sum("colB")).explain()
== Physical Plan ==
*(1) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
In any case, it's best to use window functions to assign row numbers and sum the rows if their row number meets a certain condition (e.g. row_number <= 2). This will result in a deterministic outcome. For example,
df0.withColumn(
"rn",
row_number().over(Window.orderBy($"colB".desc))
).filter("rn <= 2").agg(sum("colB"))

Scala spark: Sum all columns across all rows

I can do this quite easily with
df.groupBy().sum()
But I'm not sure if the groupBy() doesn't add additional performance impacts, or is just bad style. I've seen it done with
df.agg( ("col1", "sum"), ("col2", "sum"), ("col3", "sum"))
Which skips the (I think unnecessary groupBy), but has its own uglyness. What's the correct way to do this? Is there any under-the-hood difference between using .groupBy().<aggOp>() and using .agg?
If you check the Physical plan for both queries spark internally calls same plan so we can use either of them!
I think using df.groupBy().sum() will be handy as we don't need to specify all column names.
Example:
val df=Seq((1,2,3),(4,5,6)).toDF("id","j","k")
scala> df.groupBy().sum().explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]
scala> df.agg(sum("id"),sum("j"),sum("k")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]

Does avg() on a dataset produce the most efficient RDD?

As far as I understand, this is the most efficient way to calculate average in Spark: Spark : Average of values instead of sum in reduceByKey using Scala.
My question is: if I use the high-level dataset with a groupby followed by Spark functions' avg(), will I get the same RDD under the hood? Can I trust Catalyst or I should use the low-level RDD? I mean, will writing low-level code yield better results than a dataset?
Example code:
employees
.groupBy($"employee")
.agg(
avg($"salary").as("avg_salary")
)
Versus:
employees
.mapValues(employee => (employee.salary, 1)) // map entry with a count of 1
.reduceByKey {
case ((sumL, countL), (sumR, countR)) =>
(sumL + sumR, countL + countR)
}
.mapValues {
case (sum , count) => sum / count
}
I don't see it a black-and-white question. In general, if you have a RDD, especially if it's a PairRDD, and need a result in RDD, it would make sense to settle with reduceByKey. On the other hand, given a DataFrame I would recommend going with groupBy/agg(avg).
A couple of things to consider:
Built-in optimization
While reduceByKey is relatively efficient compared to functions like groupByKey, it does induce stage boundaries since the operation requires repartitioning the data by keys. Depending on the RDD's partitions, the number of tasks in the derived stage may end up to be too small to take advantage of the available cpu cores, potentially resulting in a performance bottleneck. Such performance issue could be addressed, for instance, by manually assigning numPartition in reduceByKey:
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
But the point is that, to fully optimize RDD operations, one might need to put in some manual tweaking effort. In contrary, most operations for DataFrames are automatically optimized by the built-in Catalyst query optimizer.
Memory usage efficiency
Perhaps the other more significant factor to be looked at is related to memory usage for large dataset. When a RDD needs to be distributed across nodes or written to disk, Spark will serialize every row of data into objects, subject to costly Garbage Collection overhead. On the other hand, with knowledge of a DataFrame's schema, Spark doesn't need to serialize the data into objects. The Tungsten execution engine can leverage off-heap memory to store data in binary format for transformations, resulting in more efficient use of memory.
In conclusion, while there may be more knobs for tweaking using low-level code, that does not necessarily result in more performant code due to inadequate optimization, additional cost for serialization, etc.
We can conclude this from the plan generated by Spark.
This is the plan for DataFrame syntax-
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees
.groupBy($"employee")
.agg(
avg($"salary").as("avg_salary")
).explain(true)
Plan -
== Parsed Logical Plan ==
'Aggregate ['employee], [unresolvedalias('employee, None), avg('salary) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, avg_salary: double
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Optimized Logical Plan ==
Aggregate [employee#4], [employee#4, avg(salary#5) AS avg_salary#11]
+- LocalRelation [employee#4, salary#5]
== Physical Plan ==
*(2) HashAggregate(keys=[employee#4], functions=[avg(salary#5)], output=[employee#4, avg_salary#11])
+- Exchange hashpartitioning(employee#4, 10)
+- *(1) HashAggregate(keys=[employee#4], functions=[partial_avg(salary#5)], output=[employee#4, sum#17, count#18L])
+- LocalTableScan [employee#4, salary#5]
As the plan suggests first "HashAggregate" happened with partial average then "exchange hashpartitioning" happened for full average. The conclusion is that catalyst optimized the DataFrame operation as if we programmed with "reduceByKey" syntax. So we needn't take the burden of writing low level code.
Here is how RDD code and plan looks like.
employees
.map(employee => ("key",(employee.getAs[Double]("salary"), 1))) // map entry with a count of 1
.rdd.reduceByKey {
case ((sumL, countL), (sumR, countR)) =>
(sumL + sumR, countL + countR)
}
.mapValues {
case (sum , count) => sum / count
}.toDF().explain(true)
Plan -
== Parsed Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Analyzed Logical Plan ==
_1: string, _2: double
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#30, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#31]
+- ExternalRDD [obj#29]
== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- ExternalRDD [obj#29]
== Physical Plan ==
*(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#30, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#31]
+- Scan[obj#29]
The plan is optimized and also involves serialization of data into objects which means extra pressure of memory.
Conclusion
I would use daraframe syntax for its simplicity and possibly better performance.
Debug at println("done") , Go to http://localhost:4040/stages/ ,You will get the result.
val spark = SparkSession
.builder()
.master("local[*]")
.appName("example")
.getOrCreate()
val employees = spark.createDataFrame(Seq(("employee1",1000),("employee2",2000),("employee3",1500))).toDF("employee","salary")
import spark.implicits._
import org.apache.spark.sql.functions._
// Spark functions
employees
.groupBy("employee")
.agg(
avg($"salary").as("avg_salary")
).show()
// your low-level code
println("done")

Ensuring narrow dependency in Spark job when grouping on pre-partitioned data

I have a huge Spark Dataset with columns A, B, C, D, E. Question is if I initially repartition on column A, and subsequently do two 'within-partition' groupBy operations:
**groupBy("A", "C")**....map(....).**groupBy("A", "E")**....map(....)
is Spark 2.0 clever enough to by-pass shuffling since both groupBy operations are 'within-partition' with respect to the parent stage - i.e. column A is included in both groupBy column specs? If not, what can I do to ensure a narrow dependency throughout the chain of operations?
Spark indeed supports optimization like this. You can check that by analyzing execution plan:
val df = Seq(("a", 1, 2)).toDF("a", "b", "c")
df.groupBy("a").max().groupBy("a", "max(b)").sum().explain
== Physical Plan ==
*HashAggregate(keys=[a#42, max(b)#92], functions=[sum(cast(max(b)#92 as bigint)), sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42, max(b)#92], functions=[partial_sum(cast(max(b)#92 as bigint)), partial_sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42], functions=[max(b#43), max(c#44)])
+- Exchange hashpartitioning(a#42, 200)
+- *HashAggregate(keys=[a#42], functions=[partial_max(b#43), partial_max(c#44)])
+- LocalTableScan [a#42, b#43, c#44]
As you can see there is only one exchange but two hash aggregates.