Pyspark Update, Insert records on LargeData parquet file - pyspark

I have 70M+ Records(116MB) in my Data example columns
ID, TransactionDate, CreationDate
Here ID is primary Key column. I need to Update my data with New upcoming Parquet files data which is of size <50MB.
Sample Input
ID col1 col2
1 2021-01-01 2020-08-21
2 2021-02-02 2020-08-21
New Data
ID col1 col2
1 2021-02-01 2020-08-21
3 2021-02-02 2020-08-21
Output Rows of Data
1 2021-02-01 2020-08-21 (Updated)
3 2021-02-02 2020-08-21 (Inserted)
2 2021-02-02 2020-08-21 (Remains Same)
I have tried with Various approaches But none of them giving proper results with Less Shuffle Read & Write and Execution Time.
Few of my approaches.
Inner Join(Update Records), Left-Anti(Insert Records), Left-Anti(Remains Same records ) Joins
Taking 10Minutes to execute with 9.5GB Shuffle Read and 9.5 GB shuffle right.
I tried with some partitionBy on creationDate approach but unable to get how to read New data with appropriate partition.
Help me with better approach that takes less time. With less shuffle read and write in Pyspark
Thanks in Advance.

You cannot avoid some shuffle, but at least you can limit it by doing only one full outer join instead of one inner join and two anti joins.
You first add a new column updated to your new dataframe, to determine if joined row is updated or not, then you perform your full outer join, and finally you select value for each column from new or old data according to updated column. Code as follow, with old_data dataframe as current data and new_data dataframe as updated data:
from pyspark.sql import functions as F
join_columns = ['ID']
final_data = new_data \
.withColumn('updated', F.lit(True)) \
.join(old_data, join_columns, 'full_outer') \
.select(
[F.col(c) for c in join_columns] +
[F.when(F.col('updated'), new_data[c]).otherwise(old_data[c]).alias(c) for c in old_data.columns if c not in join_columns]
)
If you look at execution plan using .explain() method on final_data dataframe, you can see that you have only two shuffles (the Exchange step), one per joined dataframe:
== Physical Plan ==
*(5) Project [coalesce(ID#6L, ID#0L) AS ID#17L, CASE WHEN exists#12 THEN col1#7 ELSE col1#1 END AS col1#24, CASE WHEN exists#12 THEN col2#8 ELSE col2#2 END AS col2#25]
+- SortMergeJoin [ID#6L], [ID#0L], FullOuter
:- *(2) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#27]
: +- *(1) Project [ID#6L, col1#7, col2#8, true AS exists#12]
: +- *(1) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(4) Sort [ID#0L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#32]
+- *(3) Scan ExistingRDD[ID#0L,col1#1,col2#2]
If you look at your one inner join and two anti join execution plan, you get six shuffles:
== Physical Plan ==
Union
:- *(5) Project [ID#0L, col1#7, col2#8]
: +- *(5) SortMergeJoin [ID#0L], [ID#6L], Inner
: :- *(2) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]
: : +- *(1) Project [ID#0L]
: : +- *(1) Filter isnotnull(ID#0L)
: : +- *(1) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(4) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#183]
: +- *(3) Filter isnotnull(ID#6L)
: +- *(3) Scan ExistingRDD[ID#6L,col1#7,col2#8]
:- SortMergeJoin [ID#0L], [ID#6L], LeftAnti
: :- *(7) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#192]
: : +- *(6) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(9) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#197]
: +- *(8) Project [ID#6L]
: +- *(8) Filter isnotnull(ID#6L)
: +- *(8) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- SortMergeJoin [ID#6L], [ID#0L], LeftAnti
:- *(11) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#203]
: +- *(10) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(13) Sort [ID#0L ASC NULLS FIRST], false, 0
+- ReusedExchange [ID#0L], Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]

Related

Avoid shuffle when joining two tables with same partitions without bucketting strategy

I have two Hive tables A and B with:
same partitions (partition_1, partition_2)
an extra id field that is not sorted in partitions
When I join these two tables in PySpark, for example with:
df_A = spark.table("db.A")
df_B = spark.table("db.B")
df = df_A.join(df_B, how="inner", on=["partition_1", "partition_2", "id"])
I always end up with a shuffle:
+- == Initial Plan ==
Project (23)
+- SortMergeJoin Inner (22)
:- Sort (18)
: +- Exchange (17)
: +- Filter (16)
: +- Scan parquet db.A (15)
+- Sort (21)
+- Exchange (20)
+- Filter (19)
+- Scan parquet db.B (7)
I created two similar tables but with a bucketting strategy this time:
df.write.partitionBy("partition_A", "partition_B").bucketBy(10, "id").saveAsTable(...)
And there is no more shuffle in the join
+- == Initial Plan ==
Project (17)
+- SortMergeJoin Inner (16)
:- Sort (13)
: +- Filter (12)
: +- Scan parquet db.A (11)
+- Sort (15)
+- Filter (14)
+- Scan parquet db.B (5)
My questions are:
Can I avoid this shuffle in the join without having to re-create the tables with a bucketting strategy ?
Does this shuffle operate on all data ? Or does it consider that the partitions are the same and optimise this shuffle ?
What I tried so far:
repartitioning on partitions (df.repartition("partition_A", "partition_B")) on both tables before joining
repartitioning on partitions and id field (df.repartition(numPartitions, "partition_A", "partition_B", "id"))
sorting data by id before joining
But the shuffle is still here.
I tried on both Databricks and EMR runtimes with same behaviour.
Thanks for your help

Aggregation after sort(), persist() and limit() in Spark

I'm trying to get the sum of a column of the top n rows in a persisted DataFrame. For some reason, the following doesn't work:
val df = df0.sort(col("colB").desc).persist()
df.limit(2).agg(sum("colB")).show()
It shows a random number which is clearly less than the sum of the top two. The number changes from run-to-run.
Calling show() on the limit()'ed DF does consistently show the correct top two values:
df.limit(2).show()
It is as if sort() doesn't apply before the aggregation. Is this a bug in Spark? I suppose it's kind of expected that persist() loses the sorting, but why does it work with show() and should this be documented somewhere?
See the query plans below. agg results in an exchange (4th line in physical plan) which removes the sorting, whereas show does not result in any exchange, so sorting is maintained.
scala> df.limit(2).agg(sum("colB")).explain()
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(2) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(2) GlobalLimit 2
+- Exchange SinglePartition, true, [id=#95]
+- *(1) LocalLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
scala> df.limit(2).explain()
== Physical Plan ==
CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
But if you persist the limited dataframe, there won't be any exchange for the aggregation, so that might do the trick:
val df1 = df.limit(2).persist()
scala> df1.agg(sum("colB")).explain()
== Physical Plan ==
*(1) HashAggregate(keys=[], functions=[sum(cast(colB#4 as bigint))])
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(colB#4 as bigint))])
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- CollectLimit 2
+- *(1) ColumnarToRow
+- InMemoryTableScan [colB#4]
+- InMemoryRelation [colB#4], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Sort [colB#4 DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(colB#4 DESC NULLS LAST, 200), true, [id=#7]
+- LocalTableScan [colB#4]
In any case, it's best to use window functions to assign row numbers and sum the rows if their row number meets a certain condition (e.g. row_number <= 2). This will result in a deterministic outcome. For example,
df0.withColumn(
"rn",
row_number().over(Window.orderBy($"colB".desc))
).filter("rn <= 2").agg(sum("colB"))

Scala spark: Sum all columns across all rows

I can do this quite easily with
df.groupBy().sum()
But I'm not sure if the groupBy() doesn't add additional performance impacts, or is just bad style. I've seen it done with
df.agg( ("col1", "sum"), ("col2", "sum"), ("col3", "sum"))
Which skips the (I think unnecessary groupBy), but has its own uglyness. What's the correct way to do this? Is there any under-the-hood difference between using .groupBy().<aggOp>() and using .agg?
If you check the Physical plan for both queries spark internally calls same plan so we can use either of them!
I think using df.groupBy().sum() will be handy as we don't need to specify all column names.
Example:
val df=Seq((1,2,3),(4,5,6)).toDF("id","j","k")
scala> df.groupBy().sum().explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]
scala> df.agg(sum("id"),sum("j"),sum("k")).explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(cast(id#7 as bigint)), sum(cast(j#8 as bigint)), sum(cast(k#9 as bigint))])
+- Exchange SinglePartition
+- *(1) HashAggregate(keys=[], functions=[partial_sum(cast(id#7 as bigint)), partial_sum(cast(j#8 as bigint)), partial_sum(cast(k#9 as bigint))])
+- LocalTableScan [id#7, j#8, k#9]

Sorted hive tables sorting again after union in spark

I start spark-shell with spark 2.3.1 with these params:
--master='local[*]'
--executor-memory=6400M
--driver-memory=60G
--conf spark.sql.autoBroadcastJoinThreshold=209715200
--conf spark.sql.shuffle.partitions=1000
--conf spark.local.dir=/data/spark-temp
--conf spark.driver.extraJavaOptions='-Dderby.system.home=/data/spark-catalog/'
Then create two hive tables with sort and buckets
First table name - table1
Second table name - table2
val storagePath = "path_to_orc"
val storage = spark.read.orc(storagePath)
val tableName = "table1"
sql(s"DROP TABLE IF EXISTS $tableName")
storage.select($"group", $"id").write.bucketBy(bucketsCount, "id").sortBy("id").saveAsTable(tableName)
(the same code for table2)
I expected that when i join any of this tables with another df, there is not unnecessary Exchange step in query plan
Then i turn off broadcast to use SortMergeJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
I take some df
val sample = spark.read.option("header", "true).option("delimiter", "\t").csv("path_to_tsv")
val m = spark.table("table1")
sample.select($"col" as "id").join(m, Seq("id")).explain()
== Physical Plan ==
*(4) Project [id#24, group#0]
+- *(4) SortMergeJoin [id#24], [id#1], Inner
:- *(2) Sort [id#24 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#24, 1000)
: +- *(1) Project [col#21 AS id#24]
: +- *(1) Filter isnotnull(col#21)
: +- *(1) FileScan csv [col#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/samples/sample-20K], PartitionFilters: [], PushedFilters: [IsNotNull(col)], ReadSchema: struct<col:string>
+- *(3) Project [group#0, id#1]
+- *(3) Filter isnotnull(id#1)
+- *(3) FileScan parquet default.table1[group#0,id#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
But when i use union for two tables before join
val m2 = spark.table("table2")
val mUnion = m union m2
sample.select($"col" as "id").join(mUnion, Seq("id")).explain()
== Physical Plan ==
*(6) Project [id#33, group#0]
+- *(6) SortMergeJoin [id#33], [id#1], Inner
:- *(2) Sort [id#33 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#33, 1000)
: +- *(1) Project [col#21 AS id#33]
: +- *(1) Filter isnotnull(col#21)
: +- *(1) FileScan csv [col#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/samples/sample-20K], PartitionFilters: [], PushedFilters: [IsNotNull(col)], ReadSchema: struct<col:string>
+- *(5) Sort [id#1 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#1, 1000)
+- Union
:- *(3) Project [group#0, id#1]
: +- *(3) Filter isnotnull(id#1)
: +- *(3) FileScan parquet default.membership_g043_append[group#0,id#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
+- *(4) Project [group#4, id#5]
+- *(4) Filter isnotnull(id#5)
+- *(4) FileScan parquet default.membership_g042[group#4,id#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table2], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
In this case appeared sort and partition (step 5)
How to union two hive tables without sorting and exchanging
As far as I know, spark does not consider sorting when joining but only partitions. So in order to get efficient joins, you must partition by the same column. This is because sorting does not guarantee that records with same key end up in the same partition. Spark has to make sure all keys with same values are shuffled to the same partition and on the same executor from multiple dataframes.

Spark SQL detects normal inner join as cross join [duplicate]

I want to join data twice as below:
rdd1 = spark.createDataFrame([(1, 'a'), (2, 'b'), (3, 'c')], ['idx', 'val'])
rdd2 = spark.createDataFrame([(1, 2, 1), (1, 3, 0), (2, 3, 1)], ['key1', 'key2', 'val'])
res1 = rdd1.join(rdd2, on=[rdd1['idx'] == rdd2['key1']])
res2 = res1.join(rdd1, on=[res1['key2'] == rdd1['idx']])
res2.show()
Then I get some error :
pyspark.sql.utils.AnalysisException: u'Cartesian joins could be
prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true;'
But I think this is not a cross join
UPDATE:
res2.explain()
== Physical Plan ==
CartesianProduct
:- *SortMergeJoin [idx#0L, idx#0L], [key1#5L, key2#6L], Inner
: :- *Sort [idx#0L ASC, idx#0L ASC], false, 0
: : +- Exchange hashpartitioning(idx#0L, idx#0L, 200)
: : +- *Filter isnotnull(idx#0L)
: : +- Scan ExistingRDD[idx#0L,val#1]
: +- *Sort [key1#5L ASC, key2#6L ASC], false, 0
: +- Exchange hashpartitioning(key1#5L, key2#6L, 200)
: +- *Filter ((isnotnull(key2#6L) && (key2#6L = key1#5L)) && isnotnull(key1#5L))
: +- Scan ExistingRDD[key1#5L,key2#6L,val#7L]
+- Scan ExistingRDD[idx#40L,val#41]
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition:
res2.explain()
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Join Inner, ((idx#204L = key1#209L) && (key2#210L = idx#204L))
:- Filter isnotnull(idx#204L)
: +- LogicalRDD [idx#204L, val#205]
+- Filter ((isnotnull(key2#210L) && (key2#210L = key1#209L)) && isnotnull(key1#209L))
+- LogicalRDD [key1#209L, key2#210L, val#211L]
and
LogicalRDD [idx#235L, val#236]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
In case like this you should use aliases:
from pyspark.sql.functions import col
rdd1 = spark.createDataFrame(...).alias('rdd1')
rdd2 = spark.createDataFrame(...).alias('rdd2')
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).alias('res1')
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx')).explain()
== Physical Plan ==
*SortMergeJoin [key2#297L], [idx#360L], Inner
:- *Sort [key2#297L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key2#297L, 200)
: +- *SortMergeJoin [idx#290L], [key1#296L], Inner
: :- *Sort [idx#290L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(idx#290L, 200)
: : +- *Filter isnotnull(idx#290L)
: : +- Scan ExistingRDD[idx#290L,val#291]
: +- *Sort [key1#296L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(key1#296L, 200)
: +- *Filter (isnotnull(key2#297L) && isnotnull(key1#296L))
: +- Scan ExistingRDD[key1#296L,key2#297L,val#298L]
+- *Sort [idx#360L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(idx#360L, 200)
+- *Filter isnotnull(idx#360L)
+- Scan ExistingRDD[idx#360L,val#361]
For details see SPARK-6459.
I was also successful when persisted the dataframe before the second join.
Something like:
res1 = rdd1.join(rdd2, col('rdd1.idx') == col('rdd2.key1')).persist()
res1.join(rdd1, on=col('res1.key2') == col('rdd1.idx'))
Persisting did not work for me.
I overcame it with aliases on DataFrames
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))