Dynamic boolean join in pyspark - pyspark

I have two pyspark dataframe with same schema as below -
df_source:
id, name, age
df_target:
id,name,age
"id" is primary column in both the tables and rest are attribute columns
i am accepting primary and attribute column list from user as below-
primary_columns = ["id"]
attribute_columns = ["name","age"]
I need to join above two dataframes dynamically as below -
df_update = df_source.join(df_target, (df_source["id"] == df_target["id"]) & ((df_source["name"] != df_target["name"]) | (df_source["age"] != df_target["age"])) ,how="inner").select([df_source[col] for col in df_source.columns])
how can i achieve this join condition dynamically in pyspark since number of attribute and primary key columns can change as per the user input? Please help.

IIUC, you just can achieve the desired output using an inner join on the primary_columns and a where clause that loops over the attribute_columns.
Since the two DataFrames have the same column names, use alias to differentiate column names after the join.
from functools import reduce
from pyspark.sql.functions import col
df_update = df_source.alias("s")\
.join(df_target.alias("t"), on=primary_columns, how="inner")\
.where(
reduce(
lambda a, b: a|b,
[(col("s."+c) != col("t."+c) for c in attribute_columns]
)\
)
.select("s.*")
Use reduce to apply the bitwise OR operation on the columns in attribute_columns.

Related

How to retrieve column value by passing another column value with IN clause in spark

I have a scenario that to read a column from DataFrame by using another column from same DataFrame through where condition and this value pass through as IN condition to select same value from another DataFrame and how can I achieve in spark DataFrame.
In SQL it will be like:
select distinct(A.date) from table A where A.key in (select B.key from table B where cond='D');
I tried like below:
val Bkey: DataFrame = b_df.filter(col("cond")==="D").select(col("key"))
I have table A data in a_df DataFrame and table B data in b_df DataFrame. How can I pass variable Bkey value to outer query and achieve in Spark?
You can do a semi join:
val result = a_df.join(b_df.filter(col("cond")==="D"), Seq("key"), "left_semi").select("date").distinct()

Variable substitution in scala

I have two dataframes in scala both having data from two different tables but of same structure (srcdataframe and tgttable). I have to join these two based on composite primary key and select few columns and append two columns the code for which is as below:
for(i <- 2 until numCols) {
srcdataframe.as("A")
.join(tgttable.as("B"), $"A.INSTANCE_ID" === $"B.INSTANCE_ID" &&
$"A.CONTRACT_LINE_ID" === $"B.CONTRACT_LINE_ID", "inner")
.filter($"A." + srcColnm(i) =!= $"B." + srcColnm(i))
.select($"A.INSTANCE_ID",
$"A.CONTRACT_LINE_ID",
"$"+"\""+"A."+srcColnm(i)+"\""+","+"$"+"\""+"B."+srcColnm(i)+"\"")
.withColumn("MisMatchedCol",lit("\""+srcColnm(i)+"\""))
.withColumn("LastRunDate",current_timestamp.cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch");
hiveSQLContext.sql("Insert into table xxxx.f2f_Mismatch1 select t.* from (select * from IPF_1M_Mismatch) t");}
Here are the things am trying to do:
Inner join of srcdataframe and tgttable based on instance_id and contract_line_id.
Select only instance_id, contract_line_id, mismatched_col_values, hardcode of mismatched_col_nm, timestamp.
srcColnm(i) is an array of strings which contains the non-primary keys to be compared.
However, I am not able to resolve the variables inside the dataframe in the for loop. I tried looking up for solutions here and here. I got to know that it may be because of the way spark substitutes the variables only at compile time, in this case I'm not sure how to resolve it.
Instead of creating columns with $, you can simply use strings or the col() function. I would also recommend performing the join outside of the for as it's an expensive operation. Slightly changed code, the main difference to solve your problem is in the select:
val df = srcdataframe.as("A")
.join(tgttable.as("B"), Seq("INSTANCE_ID", "CONTRACT_LINE_ID"), "inner")
for(columnName <- srcColnm) {
df.filter(col("A." + columnName) =!= col("B." + columnName))
.select("INSTANCE_ID", "CONTRACT_LINE_ID", "A." + columnName, "B." + columnName)
.withColumn("MisMatchedCol", lit(columnName))
.withColumn("LastRunDate", current_timestamp().cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch")
// Hive command
}
Regarding the problem in select:
$ is short for the col() function, it's selecting a column in the dataframe by name. The problem in the select is that the two first arguments col("A.INSTANCE_ID") and col("A.CONTRACT_LINE_ID") are two columns ($replaced bycol()` for clarity).
However, the next two arguments are strings. It is not possible to mix these two, either all arguments should be columns or all are strings. As you used "A."+srcColnm(i) to build up the column name $ can't be used, however, you could have used col("A."+srcColnm(i)).

Minus logic implementation not working with spark/scala

Minus Logic in Hive:
The below (Hive)query will return only records available in left side table ( Full_Table ft), but not in both.
Select ft.* from Full_Table ft left join Stage_Table stg where stg.primary_key1 IS null and stg.primary_key2 IS null
I tried to implement the same in spark/scala using following method ( To support both primary key and composite key ) , But joined result set does not have column from right table, because of that not able to apply stg.primary_key2 IS null condition in joined result set.
ft.join(stg,usingColumns, “left_outer”) // used seq to support composite key column join
Please suggest me how to implement minus logic in spark scala.
Thanks,
Saravanan
https://www.linkedin.com/in/saravanan303/
If your tables have the same columns you can use except method from DataSet:
fullTable.except(stageTable)
If they don't have, but you are interested only on subset of columns that exists in both tables you can first select those column using select transformation and than use except:
val fullTableSelectedColumns = fullTable.select(c1,c2,c3)
val stageTableSelectedColumns = stageTable.select(c1,c2,c3)
fullTableSelectedColumns.except(stageTableSelectedColumns)
On other case, you can use join and filter transformations:
fullTable
.join(stageTable, fullTable("primary_key") === stageTable("primary_key"), "left")
.filter(stageTable("primary_key1").isNotNull)

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark

Join Dataframes in Spark

I have joined two Dataframes in spark using below code -
Dataframes are: expDataFrame, accountList
val expDetails = expDataFrame.as("fex").join(accountList.as("acctlist"),$"fex.acct_id" === $"acctlist.acct_id", "inner")
Now I am trying to show both acct_id from both dataframe.
I have done below code -
expDetails.select($"fex.acct_id",$"acct_id.acct_id").show
but getting same column name twice as acct_id
I want two unique column name like fex_acct_id, acctlist_acct_id to identify the column from which dataframe.
You simply have to add an alias to the columns using the as or alias methods. This will do the job :
expDetails.select(
$"fex.acct_id".as("fex_acct_id"),
$"acct_id.acct_id".as("acctlist_acct_id")
).show