spark join raises "Detected cartesian product for INNER join" - pyspark

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some other column1:
maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)
And in second string I get an error:
AnalysisException: u'Detected cartesian product for INNER join between
logical plans\nProject ... Use the CROSS JOIN syntax to allow
cartesian products between these relations.;'
What do I not understand: why spark finds here cartesian product?
A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.

As described in Why does spark think this is a cross/cartesian join, it may be caused by:
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.
As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.

Try to persist the dataframes before joining them. Worked for me.

I've faced the same problem with cartesian product for my join.
In order to overcome it I used aliases on DataFrames. See example
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

Related

What is Alteryx right outer self join equal to, in pyspark?

I am trying to migrate the alteryx workflow in pyspark dataframes, as part of which I came across this right outer self join on different columns (ph_id_1 and ph_id_2), while doing the same in pyspark, i am not getting the correct output, have tried Anti, left anti join. All are giving the same result. Any suggestion how to do it in pyspark way or sql way.
Tried:
df_new = df_1.join(df_2,[df_1['ph_id_1'] == df_2['ph_id_2']],how='left_anti')
, and
df_new = df_1.filter(df_1['ph_id_1'] != df_2['ph_id_2'])
Both are giving same results, different from actual one.

Spark scala join dataframe within a dataframe

I have a requirement where I need to join dataframes A and B and calculate a column and use that calculated value in another join between the same 2 dataframes with different Join conditions.
e.g.:
DF_Combined = A_DF.join(B_DF,'Join-Condition',"left_outer").withColumn(col1,'value')
after doing the above I need to do the same join but use the value calculated in the previous join.
DF_Final=A_DF.join(B_DF,'New join COndition',"left_outer").withcolumn(col2,DF_Combined.col1*vol1*10)
When I try to do this I get a Cartesian product issue.
You can't use a column that is not present in dataframe. I mean when you do A_DF.join(B_DF,... in the resulting dataframe you only have columns from A_DF and B_DF. If you want to have the new column - you need to use DF_Combined.
From your question i believe you don't need to have another join, but you have 2 possible options:
1. When you do first join - at this place calculate vol1*10.
2. After join do DF_Combined.withColumn....
But please remember - withColumn(name, expr) creates a column with a namesetting value to result of expr. So .withcolumn(DF_Combined.col1,vol1*10) does not make sense.

save dropped duplicates in pyspark RDD

From here, Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame, we learned how to drop duplicated observations based on some specific variables. What if I want to save those duplicate observations in form of RDD, how shall I do? I guess rdd.substract() may be not efficient if RDD contains billions of observations. So besides using rdd.substract(), is there any other way I can use?
If you need both the datasets, one having only the distinct values and the other having the duplicates, you should use subtract. That will provide an accurate result. In case you need only the duplicates, you can use sql to get that.
df.createOrReplaceTempView('mydf')
df2 = spark.sql("select *,row_number() over(partition by <<list of columns used to identify duplicates>> order by <<any column/s not used to identify duplicates>>) as row_num from mydf having row_num>1").drop('row_num')

Converting Pandas into Pyspark

So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:
indexer = recordlinkage.SortedNeighbourhoodIndex \
(left_on=column1, right_on=column2, window=41)
pairs = indexer.index(df_1,df_2)
It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).
My code:
df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)
func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???
I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.
That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame
Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.
I can't join both dataframes as they have different lengths.
Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.
For similarity joins you can use built-in approx join tools:
Efficient string matching in Apache Spark
Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

Spark Scala copy column from one dataframe to another

I have a modified version of the original dataframe on which I did clustering,
Now I want to bring the predicted column back to the original DF (the index is ok, so it matches). How am I supposed to do this?
With this code I get an error.
println("Predicted:")
dfWithOutput.show
println("Original:")
originalDF = originalDF.withColumn("cluster", dfWithOutput.col("prediction")
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) prediction#2121 missing from (list of columns in the original df)
you need to join the two dataframes and then select the columns you're interested in