Aggregation after sort(), persist() and limit() in Spark

Aggregation after sort(), persist() and limit() in Spark - scala

I'm trying to get the sum of a column of the top n rows in a persisted DataFrame. For some reason, the following doesn't work:
val df = df0.sort(col("colB").desc).persist()
df.limit(2).agg(sum("colB")).show()
It shows a random number which is clearly less than the sum of the top two. The number changes from run-to-run.
Calling show() on the limit()'ed DF does consistently show the correct top two values:
df.limit(2).show()
It is as if sort() doesn't apply before the aggregation. Is this a bug in Spark? I suppose it's kind of expected that persist() loses the sorting, but why does it work with show() and should this be documented somewhere?

See the query plans below. agg results in an exchange (4th line in physical plan) which removes the sorting, whereas show does not result in any exchange, so sorting is maintained.
scala> df.limit(2).agg(sum("colB")).explain()
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(2) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(2) GlobalLimit 2
+- Exchange SinglePartition, true, [id=#95]
+- *(1) LocalLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
scala> df.limit(2).explain()
== Physical Plan ==
CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
But if you persist the limited dataframe, there won't be any exchange for the aggregation, so that might do the trick:
val df1 = df.limit(2).persist()
scala> df1.agg(sum("colB")).explain()
== Physical Plan ==
*(1) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
In any case, it's best to use window functions to assign row numbers and sum the rows if their row number meets a certain condition (e.g. row_number <= 2). This will result in a deterministic outcome. For example,
df0.withColumn(
"rn",
row_number().over(Window.orderBy($"colB".desc))
).filter("rn <= 2").agg(sum("colB"))

Related

Pyspark Update, Insert records on LargeData parquet file

I have 70M+ Records(116MB) in my Data example columns
ID, TransactionDate, CreationDate
Here ID is primary Key column. I need to Update my data with New upcoming Parquet files data which is of size <50MB.
Sample Input
ID col1 col2
1 2021-01-01 2020-08-21
2 2021-02-02 2020-08-21
New Data
ID col1 col2
1 2021-02-01 2020-08-21
3 2021-02-02 2020-08-21
Output Rows of Data
1 2021-02-01 2020-08-21 (Updated)
3 2021-02-02 2020-08-21 (Inserted)
2 2021-02-02 2020-08-21 (Remains Same)
I have tried with Various approaches But none of them giving proper results with Less Shuffle Read & Write and Execution Time.
Few of my approaches.
Inner Join(Update Records), Left-Anti(Insert Records), Left-Anti(Remains Same records ) Joins
Taking 10Minutes to execute with 9.5GB Shuffle Read and 9.5 GB shuffle right.
I tried with some partitionBy on creationDate approach but unable to get how to read New data with appropriate partition.
Help me with better approach that takes less time. With less shuffle read and write in Pyspark
Thanks in Advance.

You cannot avoid some shuffle, but at least you can limit it by doing only one full outer join instead of one inner join and two anti joins.
You first add a new column updated to your new dataframe, to determine if joined row is updated or not, then you perform your full outer join, and finally you select value for each column from new or old data according to updated column. Code as follow, with old_data dataframe as current data and new_data dataframe as updated data:
from pyspark.sql import functions as F
join_columns = ['ID']
final_data = new_data \
.withColumn('updated', F.lit(True)) \
.join(old_data, join_columns, 'full_outer') \
.select(
[F.col(c) for c in join_columns] +
[F.when(F.col('updated'), new_data[c]).otherwise(old_data[c]).alias(c) for c in old_data.columns if c not in join_columns]
)
If you look at execution plan using .explain() method on final_data dataframe, you can see that you have only two shuffles (the Exchange step), one per joined dataframe:
== Physical Plan ==
*(5) Project [coalesce(ID#6L, ID#0L) AS ID#17L, CASE WHEN exists#12 THEN col1#7 ELSE col1#1 END AS col1#24, CASE WHEN exists#12 THEN col2#8 ELSE col2#2 END AS col2#25]
+- SortMergeJoin [ID#6L], [ID#0L], FullOuter
:- *(2) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#27]
: +- *(1) Project [ID#6L, col1#7, col2#8, true AS exists#12]
: +- *(1) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(4) Sort [ID#0L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#32]
+- *(3) Scan ExistingRDD[ID#0L,col1#1,col2#2]
If you look at your one inner join and two anti join execution plan, you get six shuffles:
== Physical Plan ==
Union
:- *(5) Project [ID#0L, col1#7, col2#8]
: +- *(5) SortMergeJoin [ID#0L], [ID#6L], Inner
: :- *(2) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]
: : +- *(1) Project [ID#0L]
: : +- *(1) Filter isnotnull(ID#0L)
: : +- *(1) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(4) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#183]
: +- *(3) Filter isnotnull(ID#6L)
: +- *(3) Scan ExistingRDD[ID#6L,col1#7,col2#8]
:- SortMergeJoin [ID#0L], [ID#6L], LeftAnti
: :- *(7) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#192]
: : +- *(6) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(9) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#197]
: +- *(8) Project [ID#6L]
: +- *(8) Filter isnotnull(ID#6L)
: +- *(8) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- SortMergeJoin [ID#6L], [ID#0L], LeftAnti
:- *(11) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#203]
: +- *(10) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(13) Sort [ID#0L ASC NULLS FIRST], false, 0
+- ReusedExchange [ID#0L], Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]

Scala spark: Sum all columns across all rows

I can do this quite easily with
df.groupBy().sum()
But I'm not sure if the groupBy() doesn't add additional performance impacts, or is just bad style. I've seen it done with
df.agg( ("col1", "sum"), ("col2", "sum"), ("col3", "sum"))
Which skips the (I think unnecessary groupBy), but has its own uglyness. What's the correct way to do this? Is there any under-the-hood difference between using .groupBy().<aggOp>() and using .agg?

If you check the Physical plan for both queries spark internally calls same plan so we can use either of them!
I think using df.groupBy().sum() will be handy as we don't need to specify all column names.
Example:
val df=Seq((1,2,3),(4,5,6)).toDF("id","j","k")
scala> df.groupBy().sum().explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]
scala> df.agg(sum("id"),sum("j"),sum("k")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]

Sorted hive tables sorting again after union in spark

I start spark-shell with spark 2.3.1 with these params:
--master='local[*]'
--executor-memory=6400M
--driver-memory=60G
--conf spark.sql.autoBroadcastJoinThreshold=209715200
--conf spark.sql.shuffle.partitions=1000
--conf spark.local.dir=/data/spark-temp
--conf spark.driver.extraJavaOptions='-Dderby.system.home=/data/spark-catalog/'
Then create two hive tables with sort and buckets
First table name - table1
Second table name - table2
val storagePath = "path_to_orc"
val storage = spark.read.orc(storagePath)
val tableName = "table1"
sql(s"DROP TABLE IF EXISTS $tableName")
storage.select($"group", $"id").write.bucketBy(bucketsCount, "id").sortBy("id").saveAsTable(tableName)
(the same code for table2)
I expected that when i join any of this tables with another df, there is not unnecessary Exchange step in query plan
Then i turn off broadcast to use SortMergeJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
I take some df
val sample = spark.read.option("header", "true).option("delimiter", "\t").csv("path_to_tsv")
val m = spark.table("table1")
sample.select($"col" as "id").join(m, Seq("id")).explain()
== Physical Plan ==
*(4) Project [id#24, group#0]
+- *(4) SortMergeJoin [id#24], [id#1], Inner
:- *(2) Sort [id#24 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#24, 1000)
: +- *(1) Project [col#21 AS id#24]
: +- *(1) Filter isnotnull(col#21)
: +- *(1) FileScan csv [col#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/samples/sample-20K], PartitionFilters: [], PushedFilters: [IsNotNull(col)], ReadSchema: struct<col:string>
+- *(3) Project [group#0, id#1]
+- *(3) Filter isnotnull(id#1)
+- *(3) FileScan parquet default.table1[group#0,id#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
But when i use union for two tables before join
val m2 = spark.table("table2")
val mUnion = m union m2
sample.select($"col" as "id").join(mUnion, Seq("id")).explain()
== Physical Plan ==
*(6) Project [id#33, group#0]
+- *(6) SortMergeJoin [id#33], [id#1], Inner
:- *(2) Sort [id#33 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#33, 1000)
: +- *(1) Project [col#21 AS id#33]
: +- *(1) Filter isnotnull(col#21)
: +- *(1) FileScan csv [col#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/samples/sample-20K], PartitionFilters: [], PushedFilters: [IsNotNull(col)], ReadSchema: struct<col:string>
+- *(5) Sort [id#1 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#1, 1000)
+- Union
:- *(3) Project [group#0, id#1]
: +- *(3) Filter isnotnull(id#1)
: +- *(3) FileScan parquet default.membership_g043_append[group#0,id#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
+- *(4) Project [group#4, id#5]
+- *(4) Filter isnotnull(id#5)
+- *(4) FileScan parquet default.membership_g042[group#4,id#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table2], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
In this case appeared sort and partition (step 5)
How to union two hive tables without sorting and exchanging

As far as I know, spark does not consider sorting when joining but only partitions. So in order to get efficient joins, you must partition by the same column. This is because sorting does not guarantee that records with same key end up in the same partition. Spark has to make sure all keys with same values are shuffled to the same partition and on the same executor from multiple dataframes.

Spark count vs take and length

I'm using com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 when run zeppelin notebooks and don't understand the difference between two operations in spark. One operation takes a lot of time for computation, the second one executes immediately. Could someone explain to me the differences between two operations:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
case class SomeClass(val someField:String)
val timelineItems = spark.read.format("org.apache.spark.sql.cassandra").options(scala.collection.immutable.Map("spark.cassandra.connection.host" -> "127.0.0.1", "table" -> "timeline_items", "keyspace" -> "timeline" )).load()
//some simplified code:
val timelineRow = timelineItems
.map(x => {SomeClass("test")})
.filter(x => x != null)
.toDF()
.limit(4)
//first operation (takes a lot of time. It seems spark iterates through all items in Cassandra and doesn't use laziness with limit 4)
println(timelineRow.count()) //return: 4
//second operation (executes immediately); 300 - just random number which doesn't affect the result
println(timelineRow.take(300).length) //return: 4

What is you see is a difference between implementation of Limit (an transformation-like operation) and CollectLimit (an action-like operation). However the difference in timings is highly misleading, and not something you can expect in general case.
First let's create a MCVE
spark.conf.set("spark.sql.files.maxPartitionBytes", 500)
val ds = spark.read
.text("README.md")
.as[String]
.map{ x => {
Thread.sleep(1000)
x
}}
val dsLimit4 = ds.limit(4)
make sure we start with clean slate:
spark.sparkContext.statusTracker.getJobIdsForGroup(null).isEmpty
Boolean = true
invoke count:
dsLimit4.count()
and take a look at the execution plan (from Spark UI):
== Parsed Logical Plan ==
Aggregate [count(1) AS count#12L]
+- GlobalLimit 4
+- LocalLimit 4
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
+- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
+- Relation[value#0] text
== Analyzed Logical Plan ==
count: bigint
Aggregate [count(1) AS count#12L]
+- GlobalLimit 4
+- LocalLimit 4
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
+- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
+- Relation[value#0] text
== Optimized Logical Plan ==
Aggregate [count(1) AS count#12L]
+- GlobalLimit 4
+- LocalLimit 4
+- Project
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
+- DeserializeToObject value#0.toString, obj#5: java.lang.String
+- Relation[value#0] text
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#12L])
+- *(2) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#15L])
+- *(2) GlobalLimit 4
+- Exchange SinglePartition
+- *(1) LocalLimit 4
+- *(1) Project
+- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- *(1) MapElements <function1>, obj#6: java.lang.String
+- *(1) DeserializeToObject value#0.toString, obj#5: java.lang.String
+- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/path/to/README.md], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
The core component is
+- *(2) GlobalLimit 4
+- Exchange SinglePartition
+- *(1) LocalLimit 4
which indicates that we can expect a wide operation with multiple stages. We can see a single job
spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(0)
with two stages
spark.sparkContext.statusTracker.getJobInfo(0).get.stageIds
Array[Int] = Array(0, 1)
with eight
spark.sparkContext.statusTracker.getStageInfo(0).get.numTasks
Int = 8
and one
spark.sparkContext.statusTracker.getStageInfo(1).get.numTasks
Int = 1
task respectively.
Now let's compare it to
dsLimit4.take(300).size
which generates following
== Parsed Logical Plan ==
GlobalLimit 300
+- LocalLimit 300
+- GlobalLimit 4
+- LocalLimit 4
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
+- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
+- Relation[value#0] text
== Analyzed Logical Plan ==
value: string
GlobalLimit 300
+- LocalLimit 300
+- GlobalLimit 4
+- LocalLimit 4
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
+- DeserializeToObject cast(value#0 as string).toString, obj#5: java.lang.String
+- Relation[value#0] text
== Optimized Logical Plan ==
GlobalLimit 4
+- LocalLimit 4
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#6: java.lang.String
+- DeserializeToObject value#0.toString, obj#5: java.lang.String
+- Relation[value#0] text
== Physical Plan ==
CollectLimit 4
+- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true, false) AS value#7]
+- *(1) MapElements <function1>, obj#6: java.lang.String
+- *(1) DeserializeToObject value#0.toString, obj#5: java.lang.String
+- *(1) FileScan text [value#0] Batched: false, Format: Text, Location: InMemoryFileIndex[file:/path/to/README.md], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<value:string>
While both global and local limits still occur, there is no exchange in the middle. Therefore we can expect a single stage operation. Please note that planner narrowed down limit to more restrictive value.
As expected we see a single new job:
spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(1, 0)
which generated only one stage:
spark.sparkContext.statusTracker.getJobInfo(1).get.stageIds
Array[Int] = Array(2)
with only one task
spark.sparkContext.statusTracker.getStageInfo(2).get.numTasks
Int = 1
What does it mean for us?
In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.
Obviously the latter approach won't work with number of values in the first partition is lower than the requested limit.
val dsLimit105 = ds.limit(105) // There are 105 lines
In such case the first count will use exactly the same logic as before (I encourage you to confirm that empirically), but take will take rather different path. So far we triggered only two jobs:
spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(1, 0)
Now if we execute
dsLimit105.take(300).size
you'll see that it required 3 more jobs:
spark.sparkContext.statusTracker.getJobIdsForGroup(null)
Array[Int] = Array(4, 3, 2, 1, 0)
So what's going on here? As noted before evaluating a single partition is not enough to satisfy limit in general case. In such case Spark iteratively evaluates LocalLimit on partitions, until GlobalLimit is satisfied, increasing number of partitions taken in each iteration.
Such strategy can have significant performance implications. Starting Spark jobs alone is not cheap and in cases, when upstream object is a result of wide transformation things can get quite ugly (in the best case scenario you can read shuffle files, but if these are lost for some reason, Spark might be forced to re-execute all the dependencies).
To summarize:
take is an action, and can short circuit in specific cases where upstream process is narrow, and LocalLimits can be satisfy GlobalLimits using the first few partitions.
limit is a transformation, and always evaluates all LocalLimits, as there is no iterative escape hatch.
While one can behave better than the other in specific cases, there not exchangeable and neither guarantees better performance in general.

Apache Spark - Does dataset.dropDuplicates() preserve partitioning?

I know that there exist several transformations which preserve parent partitioning (if it was set before - e.g. mapValues) and some which do not preserve it (e.g. map).
I use Dataset API of Spark 2.2. My question is - does dropDuplicates transformation preserve partitioning? Imagine this code:
case class Item(one: Int, two: Int, three: Int)
import session.implicits._
val ds = session.createDataset(List(Item(1,2,3), Item(1,2,3)))
val repart = ds.repartition('one, 'two).cache()
repart.dropDuplicates(List("one", "two")) // will be partitioning preserved?

generally, dropDuplicates does a shuffle (and thus not preserve partitioning), but in your special case it does NOT do an additional shuffle because you have already partitioned the dataset in a suitable form which is taken into account by the optimizer:
repart.dropDuplicates(List("one","two")).explain()
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- *HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange hashpartitioning(one#3, two#4, 200)
+- LocalTableScan [one#3, two#4, three#5]
the keyword to look for here is : Exchange
But consider the following code where you first repartition the dataset using plain repartition():
val repart = ds.repartition(200).cache()
repart.dropDuplicates(List("one","two")).explain()
This will indeed trigger an additional shuffle ( now you have 2 Exchange steps):
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4], functions=[first(three#5, false)])
+- Exchange hashpartitioning(one#3, two#4, 200)
+- *HashAggregate(keys=[one#3, two#4], functions=[partial_first(three#5, false)])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange RoundRobinPartitioning(200)
+- LocalTableScan [one#3, two#4, three#5]
NOTE: I checked that with Spark 2.1, it may be different in Spark 2.2 because the optimizer changed in Spark 2.2 (Cost-Based Optimizer)

No, dropDuplicates doesn't preserve partitions since it has a shuffle boundary, which doesn't guarantee order.
dropDuplicates is approximately:
ds.groupBy(columnId).agg(/* take first column from any available partition */)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Aggregation after sort(), persist() and limit() in Spark - scala

Related

Pyspark Update, Insert records on LargeData parquet file

Scala spark: Sum all columns across all rows

Sorted hive tables sorting again after union in spark

Spark count vs take and length

Apache Spark - Does dataset.dropDuplicates() preserve partitioning?

Categories

Resources