I have 70M+ Records(116MB) in my Data example columns
ID, TransactionDate, CreationDate
Here ID is primary Key column. I need to Update my data with New upcoming Parquet files data which is of size <50MB.
Sample Input
ID col1 col2
1 2021-01-01 2020-08-21
2 2021-02-02 2020-08-21
New Data
ID col1 col2
1 2021-02-01 2020-08-21
3 2021-02-02 2020-08-21
Output Rows of Data
1 2021-02-01 2020-08-21 (Updated)
3 2021-02-02 2020-08-21 (Inserted)
2 2021-02-02 2020-08-21 (Remains Same)
I have tried with Various approaches But none of them giving proper results with Less Shuffle Read & Write and Execution Time.
Few of my approaches.
Inner Join(Update Records), Left-Anti(Insert Records), Left-Anti(Remains Same records ) Joins
Taking 10Minutes to execute with 9.5GB Shuffle Read and 9.5 GB shuffle right.
I tried with some partitionBy on creationDate approach but unable to get how to read New data with appropriate partition.
Help me with better approach that takes less time. With less shuffle read and write in Pyspark
Thanks in Advance.
You cannot avoid some shuffle, but at least you can limit it by doing only one full outer join instead of one inner join and two anti joins.
You first add a new column updated to your new dataframe, to determine if joined row is updated or not, then you perform your full outer join, and finally you select value for each column from new or old data according to updated column. Code as follow, with old_data dataframe as current data and new_data dataframe as updated data:
from pyspark.sql import functions as F
join_columns = ['ID']
final_data = new_data \
.withColumn('updated', F.lit(True)) \
.join(old_data, join_columns, 'full_outer') \
.select(
[F.col(c) for c in join_columns] +
[F.when(F.col('updated'), new_data[c]).otherwise(old_data[c]).alias(c) for c in old_data.columns if c not in join_columns]
)
If you look at execution plan using .explain() method on final_data dataframe, you can see that you have only two shuffles (the Exchange step), one per joined dataframe:
== Physical Plan ==
*(5) Project [coalesce(ID#6L, ID#0L) AS ID#17L, CASE WHEN exists#12 THEN col1#7 ELSE col1#1 END AS col1#24, CASE WHEN exists#12 THEN col2#8 ELSE col2#2 END AS col2#25]
+- SortMergeJoin [ID#6L], [ID#0L], FullOuter
:- *(2) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#27]
: +- *(1) Project [ID#6L, col1#7, col2#8, true AS exists#12]
: +- *(1) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(4) Sort [ID#0L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#32]
+- *(3) Scan ExistingRDD[ID#0L,col1#1,col2#2]
If you look at your one inner join and two anti join execution plan, you get six shuffles:
== Physical Plan ==
Union
:- *(5) Project [ID#0L, col1#7, col2#8]
: +- *(5) SortMergeJoin [ID#0L], [ID#6L], Inner
: :- *(2) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]
: : +- *(1) Project [ID#0L]
: : +- *(1) Filter isnotnull(ID#0L)
: : +- *(1) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(4) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#183]
: +- *(3) Filter isnotnull(ID#6L)
: +- *(3) Scan ExistingRDD[ID#6L,col1#7,col2#8]
:- SortMergeJoin [ID#0L], [ID#6L], LeftAnti
: :- *(7) Sort [ID#0L ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#192]
: : +- *(6) Scan ExistingRDD[ID#0L,col1#1,col2#2]
: +- *(9) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#197]
: +- *(8) Project [ID#6L]
: +- *(8) Filter isnotnull(ID#6L)
: +- *(8) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- SortMergeJoin [ID#6L], [ID#0L], LeftAnti
:- *(11) Sort [ID#6L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(ID#6L, 200), ENSURE_REQUIREMENTS, [id=#203]
: +- *(10) Scan ExistingRDD[ID#6L,col1#7,col2#8]
+- *(13) Sort [ID#0L ASC NULLS FIRST], false, 0
+- ReusedExchange [ID#0L], Exchange hashpartitioning(ID#0L, 200), ENSURE_REQUIREMENTS, [id=#177]
I am using Spark 2.3.0 and I have two data frames.
The first one, df1, has the schema:
root
|-- time: long (nullable = true)
|-- channel: string (nullable = false)
The second one, df2, has the schema:
root
|-- pprChannel: string (nullable = true)
|-- ppr: integer (nullable = false)
I now try to do:
spark.sql("select a.channel as channel, a.time as time, b.ppr as ppr from df1 a inner join df2 b on a.channel = b.pprChannel")
But I get Detected cartesian product for INNER join between logical plans.
When I try to recreate both on a Spark-Shell with sc.parallelize and simple Seqs, it works.
What might be wrong here?
Followup
Here is what I get when I use df1.join(df2, 'channel === 'pprChannel, "inner").explain(true):
== Parsed Logical Plan ==
Join Inner, (channel#124 = pprChannel#136)
:- Project [time#113L AS time#127L, channel#124]
: +- Project [time#113L, unnamed AS channel#124]
: +- Project [time#113L]
: +- Project [channel#23, time#113L]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98, clipDT#105L, if ((isnull(t0#93L) || isnull(t1#29L))) null else UDF(t0#93L, t1#29L) AS time#113L]
: +- Filter (clipDT#105L >= cast(50000000 as bigint))
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98, (t1#29L - t0#93L) AS clipDT#105L]
: +- Filter (((t0#93L >= cast(0 as bigint)) && (pt0#98 = 1)) && (pt1#82 = 2))
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98, pt0#98]
: +- Window [lag(pt1#82, 1, 0) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS pt0#98], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, t0#93L]
: +- Window [lag(t1#29L, 1, -1) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#93L], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, pt1#82]
: +- Project [channel#23, t1#29L, pt1#82]
: +- Filter pt1#82 IN (1,2)
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, dv0#75, if ((isnull(dv0#75) || isnull(dv1#58))) null else UDF(dv0#75, dv1#58) AS pt1#82]
: +- Filter ((t0#70L >= cast(0 as bigint)) && NOT isnan(dv0#75))
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, dv0#75]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, dv0#75, dv0#75]
: +- Window [lag(dv1#58, 1, NaN) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS dv0#75], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, t0#70L]
: +- Window [lag(t1#29L, 1, -1) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#70L], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, dv1#58]
: +- Project [channel#23, t1#29L, dv1#58]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, v0#49, abs(if ((isnull(v0#49) || isnull(v1#35))) null else UDF(v0#49, v1#35)) AS dv1#58]
: +- Filter ((t0#42L >= cast(0 as bigint)) && NOT isnan(v0#49))
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, v0#49]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, v0#49, v0#49]
: +- Window [lag(v1#35, 1, NaN) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS v0#49], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, t0#42L]
: +- Window [lag(t1#29L, 1, -1) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#42L], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23]
: +- Filter ((NOT isnull(t1#29L) && NOT isnull(v1#35)) && ((t1#29L >= cast(0 as bigint)) && NOT isnan(v1#35)))
: +- Project [_c0#10, _c1#11, t1#29L, value#18 AS v1#35, channel#23]
: +- Project [_c0#10, _c1#11, time#14L AS t1#29L, value#18, channel#23]
: +- Project [_c0#10, _c1#11, time#14L, value#18, unnamed AS channel#23]
: +- Project [_c0#10, _c1#11, time#14L, UDF(_c1#11) AS value#18]
: +- Project [_c0#10, _c1#11, UDF(_c0#10) AS time#14L]
: +- Relation[_c0#10,_c1#11] csv
+- Project [_1#133 AS pprChannel#136, _2#134 AS ppr#137]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#133, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#134]
+- ExternalRDD [obj#132]
== Analyzed Logical Plan ==
time: bigint, channel: string, pprChannel: string, ppr: int
Join Inner, (channel#124 = pprChannel#136)
:- Project [time#113L AS time#127L, channel#124]
: +- Project [time#113L, unnamed AS channel#124]
: +- Project [time#113L]
: +- Project [channel#23, time#113L]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98, clipDT#105L, if ((isnull(t0#93L) || isnull(t1#29L))) null else if ((isnull(t0#93L) || isnull(t1#29L))) null else UDF(t0#93L, t1#29L) AS time#113L]
: +- Filter (clipDT#105L >= cast(50000000 as bigint))
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98, (t1#29L - t0#93L) AS clipDT#105L]
: +- Filter (((t0#93L >= cast(0 as bigint)) && (pt0#98 = 1)) && (pt1#82 = 2))
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, pt0#98, pt0#98]
: +- Window [lag(pt1#82, 1, 0) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS pt0#98], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L]
: +- Project [channel#23, t1#29L, pt1#82, t0#93L, t0#93L]
: +- Window [lag(t1#29L, 1, -1) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#93L], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, pt1#82]
: +- Project [channel#23, t1#29L, pt1#82]
: +- Filter pt1#82 IN (1,2)
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, dv0#75, if ((isnull(dv0#75) || isnull(dv1#58))) null else if ((isnull(dv0#75) || isnull(dv1#58))) null else UDF(dv0#75, dv1#58) AS pt1#82]
: +- Filter ((t0#70L >= cast(0 as bigint)) && NOT isnan(dv0#75))
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, dv0#75]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, dv0#75, dv0#75]
: +- Window [lag(dv1#58, 1, NaN) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS dv0#75], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L]
: +- Project [channel#23, t1#29L, dv1#58, t0#70L, t0#70L]
: +- Window [lag(t1#29L, 1, -1) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#70L], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [channel#23, t1#29L, dv1#58]
: +- Project [channel#23, t1#29L, dv1#58]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, v0#49, abs(if ((isnull(v0#49) || isnull(v1#35))) null else if ((isnull(v0#49) || isnull(v1#35))) null else UDF(v0#49, v1#35)) AS dv1#58]
: +- Filter ((t0#42L >= cast(0 as bigint)) && NOT isnan(v0#49))
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, v0#49]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, v0#49, v0#49]
: +- Window [lag(v1#35, 1, NaN) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS v0#49], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23, t0#42L, t0#42L]
: +- Window [lag(t1#29L, 1, -1) windowspecdefinition(channel#23, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#42L], [channel#23], [t1#29L ASC NULLS FIRST]
: +- Project [_c0#10, _c1#11, t1#29L, v1#35, channel#23]
: +- Filter ((NOT isnull(t1#29L) && NOT isnull(v1#35)) && ((t1#29L >= cast(0 as bigint)) && NOT isnan(v1#35)))
: +- Project [_c0#10, _c1#11, t1#29L, value#18 AS v1#35, channel#23]
: +- Project [_c0#10, _c1#11, time#14L AS t1#29L, value#18, channel#23]
: +- Project [_c0#10, _c1#11, time#14L, value#18, unnamed AS channel#23]
: +- Project [_c0#10, _c1#11, time#14L, UDF(_c1#11) AS value#18]
: +- Project [_c0#10, _c1#11, UDF(_c0#10) AS time#14L]
: +- Relation[_c0#10,_c1#11] csv
+- Project [_1#133 AS pprChannel#136, _2#134 AS ppr#137]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#133, assertnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#134]
+- ExternalRDD [obj#132]
== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [UDF(t0#93L, t1#29L) AS time#127L, unnamed AS channel#124]
+- Filter ((isnotnull(pt0#98) && isnotnull(pt1#82)) && ((((t0#93L >= 0) && (pt0#98 = 1)) && (pt1#82 = 2)) && ((t1#29L - t0#93L) >= 50000000)))
+- Window [lag(t1#29L, 1, -1) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#93L, lag(pt1#82, 1, 0) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS pt0#98], [unnamed], [t1#29L ASC NULLS FIRST]
+- Project [t1#29L, if ((isnull(dv0#75) || isnull(dv1#58))) null else if ((isnull(dv0#75) || isnull(dv1#58))) null else UDF(dv0#75, dv1#58) AS pt1#82]
+- Filter (((t0#70L >= 0) && NOT isnan(dv0#75)) && if ((isnull(dv0#75) || isnull(dv1#58))) null else if ((isnull(dv0#75) || isnull(dv1#58))) null else UDF(dv0#75, dv1#58) IN (1,2))
+- Window [lag(t1#29L, 1, -1) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#70L, lag(dv1#58, 1, NaN) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS dv0#75], [unnamed], [t1#29L ASC NULLS FIRST]
+- Project [t1#29L, abs(UDF(v0#49, v1#35)) AS dv1#58]
+- Filter ((t0#42L >= 0) && NOT isnan(v0#49))
+- Window [lag(t1#29L, 1, -1) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#42L, lag(v1#35, 1, NaN) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS v0#49], [unnamed], [t1#29L ASC NULLS FIRST]
+- Project [UDF(_c0#10) AS t1#29L, UDF(_c1#11) AS v1#35]
+- Filter ((UDF(_c0#10) >= 0) && NOT isnan(UDF(_c1#11)))
+- Relation[_c0#10,_c1#11] csv
and
Project [_1#133 AS pprChannel#136, _2#134 AS ppr#137]
+- Filter (isnotnull(_1#133) && (unnamed = _1#133))
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#133, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#134]
+- ExternalRDD [obj#132]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [UDF(t0#93L, t1#29L) AS time#127L, unnamed AS channel#124]
+- Filter ((isnotnull(pt0#98) && isnotnull(pt1#82)) && ((((t0#93L >= 0) && (pt0#98 = 1)) && (pt1#82 = 2)) && ((t1#29L - t0#93L) >= 50000000)))
+- Window [lag(t1#29L, 1, -1) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#93L, lag(pt1#82, 1, 0) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS pt0#98], [unnamed], [t1#29L ASC NULLS FIRST]
+- Project [t1#29L, if ((isnull(dv0#75) || isnull(dv1#58))) null else if ((isnull(dv0#75) || isnull(dv1#58))) null else UDF(dv0#75, dv1#58) AS pt1#82]
+- Filter (((t0#70L >= 0) && NOT isnan(dv0#75)) && if ((isnull(dv0#75) || isnull(dv1#58))) null else if ((isnull(dv0#75) || isnull(dv1#58))) null else UDF(dv0#75, dv1#58) IN (1,2))
+- Window [lag(t1#29L, 1, -1) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#70L, lag(dv1#58, 1, NaN) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS dv0#75], [unnamed], [t1#29L ASC NULLS FIRST]
+- Project [t1#29L, abs(UDF(v0#49, v1#35)) AS dv1#58]
+- Filter ((t0#42L >= 0) && NOT isnan(v0#49))
+- Window [lag(t1#29L, 1, -1) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS t0#42L, lag(v1#35, 1, NaN) windowspecdefinition(unnamed, t1#29L ASC NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS v0#49], [unnamed], [t1#29L ASC NULLS FIRST]
+- Project [UDF(_c0#10) AS t1#29L, UDF(_c1#11) AS v1#35]
+- Filter ((UDF(_c0#10) >= 0) && NOT isnan(UDF(_c1#11)))
+- Relation[_c0#10,_c1#11] csv
and
Project [_1#133 AS pprChannel#136, _2#134 AS ppr#137]
+- Filter (isnotnull(_1#133) && (unnamed = _1#133))
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, assertnotnull(input[0, scala.Tuple2, true])._1, true, false) AS _1#133, assertnotnull(input[0, scala.Tuple2, true])._2 AS _2#134]
+- ExternalRDD [obj#132]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
Yes, df1 is a result of a fairly complex computation, that's why it is so big. df2 is a very small DF which always comes from a Map with at most about 50 to 100 entries brought to Spark with sc.parallelize. So I could use crossJoin and a where as a workaround. But I want to understand why Spark thinks it is a cartesian product.
Followup 2
I am now using a different approach. Since the first DF is this huge one which is the result of a complex calculation, and the second one is always originating from a small map, I changed my algorithm to use ordinary map operations to achieve it:
val bDF2Data = sc.broadcast(df2Data)
val res =
df1.
as[(Long, String)].
mapPartitions { iter =>
val df2Data = bDF2Data.value
iter.
flatMap {
case (time, channel) =>
df2Data.get(channel).map(ppr => (time, channel, ppr))
}
}.
toDF("time", "channel", "ppr").
// More operations ...