Avoid shuffle when joining two tables with same partitions without bucketting strategy - pyspark

I have two Hive tables A and B with:
same partitions (partition_1, partition_2)
an extra id field that is not sorted in partitions
When I join these two tables in PySpark, for example with:
df_A = spark.table("db.A")
df_B = spark.table("db.B")
df = df_A.join(df_B, how="inner", on=["partition_1", "partition_2", "id"])
I always end up with a shuffle:
+- == Initial Plan ==
Project (23)
+- SortMergeJoin Inner (22)
:- Sort (18)
: +- Exchange (17)
: +- Filter (16)
: +- Scan parquet db.A (15)
+- Sort (21)
+- Exchange (20)
+- Filter (19)
+- Scan parquet db.B (7)
I created two similar tables but with a bucketting strategy this time:
df.write.partitionBy("partition_A", "partition_B").bucketBy(10, "id").saveAsTable(...)
And there is no more shuffle in the join
+- == Initial Plan ==
Project (17)
+- SortMergeJoin Inner (16)
:- Sort (13)
: +- Filter (12)
: +- Scan parquet db.A (11)
+- Sort (15)
+- Filter (14)
+- Scan parquet db.B (5)
My questions are:
Can I avoid this shuffle in the join without having to re-create the tables with a bucketting strategy ?
Does this shuffle operate on all data ? Or does it consider that the partitions are the same and optimise this shuffle ?
What I tried so far:
repartitioning on partitions (df.repartition("partition_A", "partition_B")) on both tables before joining
repartitioning on partitions and id field (df.repartition(numPartitions, "partition_A", "partition_B", "id"))
sorting data by id before joining
But the shuffle is still here.
I tried on both Databricks and EMR runtimes with same behaviour.
Thanks for your help

Related

Pyspark Update, Insert records on LargeData parquet file

I have 70M+ Records(116MB) in my Data example columns
ID, TransactionDate, CreationDate
Here ID is primary Key column. I need to Update my data with New upcoming Parquet files data which is of size <50MB.
Sample Input
ID col1 col2
1 2021-01-01 2020-08-21
2 2021-02-02 2020-08-21
New Data
ID col1 col2
1 2021-02-01 2020-08-21
3 2021-02-02 2020-08-21
Output Rows of Data
1 2021-02-01 2020-08-21 (Updated)
3 2021-02-02 2020-08-21 (Inserted)
2 2021-02-02 2020-08-21 (Remains Same)
I have tried with Various approaches But none of them giving proper results with Less Shuffle Read & Write and Execution Time.
Few of my approaches.
Inner Join(Update Records), Left-Anti(Insert Records), Left-Anti(Remains Same records ) Joins
Taking 10Minutes to execute with 9.5GB Shuffle Read and 9.5 GB shuffle right.
I tried with some partitionBy on creationDate approach but unable to get how to read New data with appropriate partition.
Help me with better approach that takes less time. With less shuffle read and write in Pyspark
Thanks in Advance.
You cannot avoid some shuffle, but at least you can limit it by doing only one full outer join instead of one inner join and two anti joins.
You first add a new column updated to your new dataframe, to determine if joined row is updated or not, then you perform your full outer join, and finally you select value for each column from new or old data according to updated column. Code as follow, with old_data dataframe as current data and new_data dataframe as updated data:
from pyspark.sql import functions as F
join_columns = ['ID']
final_data = new_data \
.withColumn('updated', F.lit(True)) \
.join(old_data, join_columns, 'full_outer') \
.select(
[F.col(c) for c in join_columns] +
[F.when(F.col('updated'), new_data[c]).otherwise(old_data[c]).alias(c) for c in old_data.columns if c not in join_columns]
)
If you look at execution plan using .explain() method on final_data dataframe, you can see that you have only two shuffles (the Exchange step), one per joined dataframe:
== Physical Plan ==
*(5) Project [coalesce(ID#6L, ID#0L) AS ID#17L, CASE WHEN exists#12 THEN col1#7 ELSE col1#1 END AS col1#24, CASE WHEN exists#12 THEN col2#8 ELSE col2#2 END AS col2#25]
+- SortMergeJoin [ID#6L], [ID#0L], FullOuter
:- *(2) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#27]
: +- *(1) Project [ID#6L, col1#7, col2#8, true AS exists#12]
: +- *(1) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(4) Sort [ID#0L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#32]
+- *(3) Scan ExistingRDD[ID#0L,col1#1,col2#2]
If you look at your one inner join and two anti join execution plan, you get six shuffles:
== Physical Plan ==
Union
:- *(5) Project [ID#0L, col1#7, col2#8]
: +- *(5) SortMergeJoin [ID#0L], [ID#6L], Inner
: :- *(2) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]
: : +- *(1) Project [ID#0L]
: : +- *(1) Filter isnotnull(ID#0L)
: : +- *(1) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(4) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#183]
: +- *(3) Filter isnotnull(ID#6L)
: +- *(3) Scan ExistingRDD[ID#6L,col1#7,col2#8]
:- SortMergeJoin [ID#0L], [ID#6L], LeftAnti
: :- *(7) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#192]
: : +- *(6) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(9) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#197]
: +- *(8) Project [ID#6L]
: +- *(8) Filter isnotnull(ID#6L)
: +- *(8) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- SortMergeJoin [ID#6L], [ID#0L], LeftAnti
:- *(11) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#203]
: +- *(10) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(13) Sort [ID#0L ASC NULLS FIRST], false, 0
+- ReusedExchange [ID#0L], Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]

Join two large spark dataframes persisted in parquet using Scala

I'm trying to join two large Spark dataframes using Scala and I can't get it to perform well. I really hope someone can help me.
I have the following two text files:
dfPerson.txt (PersonId: String, GroupId: String) 2 million rows (100MB)
dfWorld.txt (PersonId: String, GroupId: String, PersonCharacteristic: String) 30 billion rows (1TB)
First I parse the text files to parquet and partition on GroupId, which has 50 distinct values and a rest group.
val dfPerson = spark.read.csv("input/dfPerson.txt")
dfPerson.write.partitionBy("GroupId").parquet("output/dfPerson")
val dfWorld = spark.read.csv("input/dfWorld.txt")
dfWorld.write.partitionBy("GroupId").parquet("output/dfWorld")
Note: a GroupId can contain 1 PersonId up to 6 billion PersonIds, so since it is skewed it might not be the best partition column but it is all I could think of.
Next I read the parquet files and join them, I took the following approaches:
Approach 1: Basic spark join operation
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This however didn't perform well and shuffled over 1 TB of data.
Approach 2: Broadcast hash join
Since dfPerson might be small enough to hold in memory I thought this approach might solve my problem
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
broadcast(dfPerson).as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This also didn't perform well and also shuffled over 1 TB of data which makes me believe the broadcast didn't work?
Approach 3: Bucket and sort the dataframe
I first try to bucket and sort the dataframes before writing to parquet and then join:
val dfPersonInput = spark.read.csv("input/dfPerson.txt")
dfPersonInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfPerson")
.saveAsTable("dfPerson")
val dfPerson = spark.table("dfPerson")
val dfWorldInput = spark.read.csv("input/dfWorld.txt")
dfWorldInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfWorld")
.saveAsTable("dfWorld")
val dfWorld = spark.table("dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
With the following execution plan:
== Physical Plan ==
*(5) Project [PersonId#743]
+- SortMergeJoin [GroupId#73, PersonId#71], [GroupId#745, PersonId#743], RightOuter
:- *(2) Sort [GroupId#73 ASC NULLS FIRST, PersonId#71 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(GroupId#73, PersonId#71, 200)
: +- *(1) Project [PersonId#71, PersonCharacteristic#72, GroupId#73]
: +- *(1) Filter isnotnull(PersonId#71)
: +- *(1) FileScan parquet default.dfWorld[PersonId#71,PersonCharacteristic#72,GroupId#73] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/F:/Output/dfWorld..., PartitionCount: 52, PartitionFilters: [isnotnull(GroupId#73)], PushedFilters: [IsNotNull(PersonId)], ReadSchema: struct<PersonId:string,PersonCharacteristic:string>, SelectedBucketsCount: 4 out of 4
+- *(4) Sort [GroupId#745 ASC NULLS FIRST, PersonId#743 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(GroupId#745, PersonId#743, 200)
+- *(3) FileScan parquet default.dfPerson[PersonId#743,GroupId#745] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/F:/Output/dfPerson], PartitionCount: 45, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<PersonId:string,GroupId:string>, SelectedBucketsCount: 4 out of 4
Also this didn't perform well.
To conclude
All approaches take approximately 150-200 hours (based on the progress on stages and tasks in the spark jobs after 24 hours) and follow the following strategy:
DAG visualization
I guess there is something I'm missing with either the partitioning, bucketing, sorting parquet, or all of them.
Any help would be greatly appreciated.
What is the goal you're trying to achieve?
Why do you need to have it joined?
Join for a sake of join will take you nowhere, unless you have enough memory/disk space to collect 1TB x 100MB worth of data
Edited based on response
If you only need records related to persons that are presented in dfPerson then you don't need right/left join, inner join would be what you want.
Broadcast will only work if your DF is less than broadcast settings in your Spark (10 Mb by default), it's ignored otherwise.
dfPerson.as("p").join(
dfWorld.select(
$"GroupId", $"PersonId",
$"<feature1YouNeed>", $"<feature2YouNeed>"
).as("w"),
Seq("GroupId", "PersonId")
)
This should give you feature you're up to
NB: Replace < feature1YouNeed > and < feature2YouNeed > with actual column names.

Apache Spark - Does dataset.dropDuplicates() preserve partitioning?

I know that there exist several transformations which preserve parent partitioning (if it was set before - e.g. mapValues) and some which do not preserve it (e.g. map).
I use Dataset API of Spark 2.2. My question is - does dropDuplicates transformation preserve partitioning? Imagine this code:
case class Item(one: Int, two: Int, three: Int)
import session.implicits._
val ds = session.createDataset(List(Item(1,2,3), Item(1,2,3)))
val repart = ds.repartition('one, 'two).cache()
repart.dropDuplicates(List("one", "two")) // will be partitioning preserved?
generally, dropDuplicates does a shuffle (and thus not preserve partitioning), but in your special case it does NOT do an additional shuffle because you have already partitioned the dataset in a suitable form which is taken into account by the optimizer:
repart.dropDuplicates(List("one","two")).explain()
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- *HashAggregate(keys=[one#3, two#4, three#5], functions=[])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange hashpartitioning(one#3, two#4, 200)
+- LocalTableScan [one#3, two#4, three#5]
the keyword to look for here is : Exchange
But consider the following code where you first repartition the dataset using plain repartition():
val repart = ds.repartition(200).cache()
repart.dropDuplicates(List("one","two")).explain()
This will indeed trigger an additional shuffle ( now you have 2 Exchange steps):
== Physical Plan ==
*HashAggregate(keys=[one#3, two#4], functions=[first(three#5, false)])
+- Exchange hashpartitioning(one#3, two#4, 200)
+- *HashAggregate(keys=[one#3, two#4], functions=[partial_first(three#5, false)])
+- InMemoryTableScan [one#3, two#4, three#5]
+- InMemoryRelation [one#3, two#4, three#5], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- Exchange RoundRobinPartitioning(200)
+- LocalTableScan [one#3, two#4, three#5]
NOTE: I checked that with Spark 2.1, it may be different in Spark 2.2 because the optimizer changed in Spark 2.2 (Cost-Based Optimizer)
No, dropDuplicates doesn't preserve partitions since it has a shuffle boundary, which doesn't guarantee order.
dropDuplicates is approximately:
ds.groupBy(columnId).agg(/* take first column from any available partition */)

Joining Two Datasets with Predicate Pushdown

I have a Dataset that i created from a RDD and try to join it with another Dataset which is created from my Phoenix Table:
val dfToJoin = sparkSession.createDataset(rddToJoin)
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
When i execute it, it seems that the whole database table is loaded to do the join.
Is there a way to do such a join so that the filtering is done on the database instead of in spark?
Also: dfToJoin is smaller than the table, i do not know if this is important.
Edit: Basically i want to join my Phoenix table with an Dataset created through spark, without fetching the whole table into the executor.
Edit2: Here is the physical plan:
*Project [FEATURE#21, SEQUENCE_IDENTIFIER#22, TAX_NUMBER#23,
WINDOW_NUMBER#24, uniqueIdentifier#5, readLength#6]
+- *SortMergeJoin [FEATURE#21], [feature#4], Inner
:- *Sort [FEATURE#21 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(FEATURE#21, 200)
: +- *Filter isnotnull(FEATURE#21)
: +- *Scan PhoenixRelation(FEATURES,localhost,false)
[FEATURE#21,SEQUENCE_IDENTIFIER#22,TAX_NUMBER#23,WINDOW_NUMBER#24]
PushedFilters: [IsNotNull(FEATURE)], ReadSchema:
struct<FEATURE:int,SEQUENCE_IDENTIFIER:string,TAX_NUMBER:int,
WINDOW_NUMBER:int>
+- *Sort [feature#4 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(feature#4, 200)
+- *Filter isnotnull(feature#4)
+- *SerializeFromObject [assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).feature AS feature#4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).uniqueIdentifier, true) AS uniqueIdentifier#5, assertnotnull(input[0, utils.CaseClasses$QueryFeature, true], top level Product input object).readLength AS readLength#6]
+- Scan ExternalRDDScan[obj#3]
As you can see the equals-filter is not contained in the pushed-filters list, so it is obvious that no predicate pushdown is happening.
Spark will fetch the Phoenix table records to appropriate executors(not the entire table to one executor)
As the is no direct filter on Phoenix table df, we see only *Filter isnotnull(FEATURE#21) in physical plan.
As you are mentioning Phoenix table data is less when you apply filter on it. You push the filter to phoenix table on feature column by finding feature_ids in other dataset.
//This spread across workers - fully distributed
val dfToJoin = sparkSession.createDataset(rddToJoin)
//This sits in driver - not distributed
val list_of_feature_ids = dfToJoin.dropDuplicates("feature")
.select("feature")
.map(r => r.getString(0))
.collect
.toList
//This spread across workers - fully distributed
val tableDf = sparkSession
.read
.option("table", "table")
.option("zkURL", "localhost")
.format("org.apache.phoenix.spark")
.load()
.filter($"FEATURE".isin(list_of_feature_ids:_*)) //added filter
//This spread across workers - fully distributed
val joinedDf = dfToJoin.join(tableDf, "columnToJoinOn")
joinedDf.explain()

Ensuring narrow dependency in Spark job when grouping on pre-partitioned data

I have a huge Spark Dataset with columns A, B, C, D, E. Question is if I initially repartition on column A, and subsequently do two 'within-partition' groupBy operations:
**groupBy("A", "C")**....map(....).**groupBy("A", "E")**....map(....)
is Spark 2.0 clever enough to by-pass shuffling since both groupBy operations are 'within-partition' with respect to the parent stage - i.e. column A is included in both groupBy column specs? If not, what can I do to ensure a narrow dependency throughout the chain of operations?
Spark indeed supports optimization like this. You can check that by analyzing execution plan:
val df = Seq(("a", 1, 2)).toDF("a", "b", "c")
df.groupBy("a").max().groupBy("a", "max(b)").sum().explain
== Physical Plan ==
*HashAggregate(keys=[a#42, max(b)#92], functions=[sum(cast(max(b)#92 as bigint)), sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42, max(b)#92], functions=[partial_sum(cast(max(b)#92 as bigint)), partial_sum(cast(max(c)#93 as bigint))])
+- *HashAggregate(keys=[a#42], functions=[max(b#43), max(c#44)])
+- Exchange hashpartitioning(a#42, 200)
+- *HashAggregate(keys=[a#42], functions=[partial_max(b#43), partial_max(c#44)])
+- LocalTableScan [a#42, b#43, c#44]
As you can see there is only one exchange but two hash aggregates.