Spark scala join dataframe within a dataframe - scala

I have a requirement where I need to join dataframes A and B and calculate a column and use that calculated value in another join between the same 2 dataframes with different Join conditions.
e.g.:
DF_Combined = A_DF.join(B_DF,'Join-Condition',"left_outer").withColumn(col1,'value')
after doing the above I need to do the same join but use the value calculated in the previous join.
DF_Final=A_DF.join(B_DF,'New join COndition',"left_outer").withcolumn(col2,DF_Combined.col1*vol1*10)
When I try to do this I get a Cartesian product issue.

You can't use a column that is not present in dataframe. I mean when you do A_DF.join(B_DF,... in the resulting dataframe you only have columns from A_DF and B_DF. If you want to have the new column - you need to use DF_Combined.
From your question i believe you don't need to have another join, but you have 2 possible options:
1. When you do first join - at this place calculate vol1*10.
2. After join do DF_Combined.withColumn....
But please remember - withColumn(name, expr) creates a column with a namesetting value to result of expr. So .withcolumn(DF_Combined.col1,vol1*10) does not make sense.

Related

PySpark group by collect_list over a window

I have a data frame with multiple columns. I'm trying to aggregate few columns using collect_list grouped on id, over a window function. I'm trying some thing like this:
exprs = [(collect_list(x).over(window)).alias(f"{x}_list") for x in cols]
df = df.groupBy('id').agg(*exprs)
I'm getting the below error:
expression is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get
If I do the same for a single column, instead of for multiple columns it is working.
I found a way for this. I guess, window functions wont work for agg(*exprs) operations. So, I modified the above to
for col_name in cols:
df = df.withColumn(col_name + "_list", collect_list(col(col_name)).over(window_spec))
This served my purpose.
Thank you.

save dropped duplicates in pyspark RDD

From here, Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame, we learned how to drop duplicated observations based on some specific variables. What if I want to save those duplicate observations in form of RDD, how shall I do? I guess rdd.substract() may be not efficient if RDD contains billions of observations. So besides using rdd.substract(), is there any other way I can use?
If you need both the datasets, one having only the distinct values and the other having the duplicates, you should use subtract. That will provide an accurate result. In case you need only the duplicates, you can use sql to get that.
df.createOrReplaceTempView('mydf')
df2 = spark.sql("select *,row_number() over(partition by <<list of columns used to identify duplicates>> order by <<any column/s not used to identify duplicates>>) as row_num from mydf having row_num>1").drop('row_num')

Converting Pandas into Pyspark

So I'm trying to convert my python algorithm to Spark friendly code and I'm having trouble with this one:
indexer = recordlinkage.SortedNeighbourhoodIndex \
(left_on=column1, right_on=column2, window=41)
pairs = indexer.index(df_1,df_2)
It basically compares one column against the other and generates index pairs for those likely to be the same (Record Matching).
My code:
df1 = spark.read.load(*.csv)
df2 = spark.read.load(*.csv)
func_udf = udf(index.indexer) ????
df = df.withColumn('column1',func_udf(df1.column1,df2.column2)) ???
I've been using udf for transformations involving just one dataframe and one column, but how do I run a function that requires two arguments, one column from one dataframe and other from other dataframe? I can't join both dataframes as they have different lengths.
That's not how udf work. UserDefinedFunctions can operate only on data that comes from a single DataFrame
Standard udf on data from a single row.
pandas_udf on data from a single partition or single group.
I can't join both dataframes as they have different lengths.
Join is exactly what you should do (standard or manual broadcast). There is no need for objects to be of the same length - Spark join is a relational join not row-wise merge.
For similarity joins you can use built-in approx join tools:
Efficient string matching in Apache Spark
Optimize Spark job that has to calculate each to each entry similarity and output top N similar items for each

spark join raises "Detected cartesian product for INNER join"

I have a dataframe and I want to add for each row new_col=max(some_column0) grouped by some other column1:
maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)
And in second string I get an error:
AnalysisException: u'Detected cartesian product for INNER join between
logical plans\nProject ... Use the CROSS JOIN syntax to allow
cartesian products between these relations.;'
What do I not understand: why spark finds here cartesian product?
A possible way to get this error: I save DF to Hive table, then init DF again as select from table. Or replace these 2 strings with hive query - no matter. But I don't want to save DF.
As described in Why does spark think this is a cross/cartesian join, it may be caused by:
This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.
As for how the cartesian product was generated? You can refer to Identifying and Eliminating the Dreaded Cartesian Product.
Try to persist the dataframes before joining them. Worked for me.
I've faced the same problem with cartesian product for my join.
In order to overcome it I used aliases on DataFrames. See example
from pyspark.sql.functions import col
df1.alias("buildings").join(df2.alias("managers"), col("managers.distinguishedName") == col("buildings.manager"))

How to fill missing values with values from other dataframes

I have one data frame with an ID:String column, a Type:Int column and a Name:String column.
This data frame has a lot of missing values in the Name column.
But I also have three other dataframes that contain an ID column and a Name column.
What I'd like to do is to fill the missing values in the first Dataframe with values from the others. The other dataframes do not contain all the IDs belonging to the first dataframe, plus they can also contain IDs that are not present in the first dataframe.
What is the right approach in this case? I Know I can combine two DFs like:
df1.join(df2, df1("ID")===df2("ID"), "left_outer")
But since I know that all entries in the first dataframe where type=2 already have a name, I'd like to restrict this join only for rows where type=1
Any idea how can I retrieve Names values from the three DFs in order to fill the Name column in my original dataframe?
You can split, join the subset of interest and gather everything back:
df1
// Select ones that may require filling
.where($"type" === 1)
// Join
.join(df2, Seq("ID"), "left_outer")
// Replace NULL if needed
.select($"ID", $"Type", coalesce(df1("Name"), df2("Name")).alias("Name"))
// Union with subset which doesn't require filling
.union(df1.where($"type" === 2)) // Or =!= 1 as suggested by #AlbertoBonsanto
If type column is nullable you should cover this scenario separately with union($"type".isNull).