pyspark join on a secondary column when primary column finds no match - pyspark

The following code is intended to do this operation, however, does not work. It still joins on both columns yielding two rows when I should only get one row.
df = df.join(
msa,
(msa.msa_join_key == df.metro_join_key)
| (
(msa.msa_join_key != df.metro_join_key)
& (msa.msa_join_key == df.state_join_key)
),
"left",
)
This code should work, so I am not sure why it does not. I've seen similar questions that also cannot solve this given that they too yield additional rows. pyspark - join with OR condition
Is this a known bug in pyspark? Is there a different way to do this join?

Related

Dynamic boolean join in pyspark

I have two pyspark dataframe with same schema as below -
df_source:
id, name, age
df_target:
id,name,age
"id" is primary column in both the tables and rest are attribute columns
i am accepting primary and attribute column list from user as below-
primary_columns = ["id"]
attribute_columns = ["name","age"]
I need to join above two dataframes dynamically as below -
df_update = df_source.join(df_target, (df_source["id"] == df_target["id"]) & ((df_source["name"] != df_target["name"]) | (df_source["age"] != df_target["age"])) ,how="inner").select([df_source[col] for col in df_source.columns])
how can i achieve this join condition dynamically in pyspark since number of attribute and primary key columns can change as per the user input? Please help.
IIUC, you just can achieve the desired output using an inner join on the primary_columns and a where clause that loops over the attribute_columns.
Since the two DataFrames have the same column names, use alias to differentiate column names after the join.
from functools import reduce
from pyspark.sql.functions import col
df_update = df_source.alias("s")\
.join(df_target.alias("t"), on=primary_columns, how="inner")\
.where(
reduce(
lambda a, b: a|b,
[(col("s."+c) != col("t."+c) for c in attribute_columns]
)\
)
.select("s.*")
Use reduce to apply the bitwise OR operation on the columns in attribute_columns.

How debug spark dropduplicate and join function calls?

There is some table with duplicated rows. I am trying to reduce duplicates and stay with latest my_date (if there are
rows with same my_date it is no matter which one to use)
val dataFrame = readCsv()
.dropDuplicates("my_id", "my_date")
.withColumn("my_date_int", $"my_date".cast("bigint"))
import org.apache.spark.sql.functions.{min, max, grouping}
val aggregated = dataFrame
.groupBy(dataFrame("my_id").alias("g_my_id"))
.agg(max(dataFrame("my_date_int")).alias("g_my_date_int"))
val output = dataFrame.join(aggregated, dataFrame("my_id") === aggregated("g_my_id") && dataFrame("my_date_int") === aggregated("g_my_date_int"))
.drop("g_my_id", "g_my_date_int")
But after this code I when grab distinct my_id I get about 3000 less than in source table. What a reason can be?
how to debug this situation?
After doing drop duplicates do a except of this data frame with the original data frame this should give some insight on the rows which are additionally getting dropped . Most probably there are certain null or empty values for those columns which are being considered duplicates.

convert SQL join query to pyspark syntax

I'm working to convert a known working SQL query to work in pyspark, given two dataframes, using methods such as: .join, .where, filter, etc.
Here are examples of SQL queries that work (only selecting r.id where I will normally select more columns):
# "invalid" records, where there is a matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
# "valid" records, where there is no matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
I'm 80/20 close, but having trouble wrapping my head around the the last few steps, and/or how to do this most efficiently.
I've got a Dataframe r_df with column id that I'd like to join with Dataframe rv_df on column record_id. As output, I'd like only distinct r.id, and only columns from r_df, none from rv_df. Finally, I'd like two different calls where there is a match (what will be "invalid" records for me), and where there is not a match (what I consider "valid" records).
I have pyspark queries that get close, but not terribly clear on how to ensure that r_df.id is distinct, and select only columns from r_df, none from rv_df.
Any help would be much appreciated!
Just had to walk away for a couple hours. Found a solution that works for my use case.
First, selecting only distinct record_id from rv_df:
rv_df = rv_df.select('record_id').distinct()
Then use that for intersection and disjoints:
# Intersection:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftsemi').select(r_df['*'])
# Disjoint:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftanti').select(r_df['*'])

Variable substitution in scala

I have two dataframes in scala both having data from two different tables but of same structure (srcdataframe and tgttable). I have to join these two based on composite primary key and select few columns and append two columns the code for which is as below:
for(i <- 2 until numCols) {
srcdataframe.as("A")
.join(tgttable.as("B"), $"A.INSTANCE_ID" === $"B.INSTANCE_ID" &&
$"A.CONTRACT_LINE_ID" === $"B.CONTRACT_LINE_ID", "inner")
.filter($"A." + srcColnm(i) =!= $"B." + srcColnm(i))
.select($"A.INSTANCE_ID",
$"A.CONTRACT_LINE_ID",
"$"+"\""+"A."+srcColnm(i)+"\""+","+"$"+"\""+"B."+srcColnm(i)+"\"")
.withColumn("MisMatchedCol",lit("\""+srcColnm(i)+"\""))
.withColumn("LastRunDate",current_timestamp.cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch");
hiveSQLContext.sql("Insert into table xxxx.f2f_Mismatch1 select t.* from (select * from IPF_1M_Mismatch) t");}
Here are the things am trying to do:
Inner join of srcdataframe and tgttable based on instance_id and contract_line_id.
Select only instance_id, contract_line_id, mismatched_col_values, hardcode of mismatched_col_nm, timestamp.
srcColnm(i) is an array of strings which contains the non-primary keys to be compared.
However, I am not able to resolve the variables inside the dataframe in the for loop. I tried looking up for solutions here and here. I got to know that it may be because of the way spark substitutes the variables only at compile time, in this case I'm not sure how to resolve it.
Instead of creating columns with $, you can simply use strings or the col() function. I would also recommend performing the join outside of the for as it's an expensive operation. Slightly changed code, the main difference to solve your problem is in the select:
val df = srcdataframe.as("A")
.join(tgttable.as("B"), Seq("INSTANCE_ID", "CONTRACT_LINE_ID"), "inner")
for(columnName <- srcColnm) {
df.filter(col("A." + columnName) =!= col("B." + columnName))
.select("INSTANCE_ID", "CONTRACT_LINE_ID", "A." + columnName, "B." + columnName)
.withColumn("MisMatchedCol", lit(columnName))
.withColumn("LastRunDate", current_timestamp().cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch")
// Hive command
}
Regarding the problem in select:
$ is short for the col() function, it's selecting a column in the dataframe by name. The problem in the select is that the two first arguments col("A.INSTANCE_ID") and col("A.CONTRACT_LINE_ID") are two columns ($replaced bycol()` for clarity).
However, the next two arguments are strings. It is not possible to mix these two, either all arguments should be columns or all are strings. As you used "A."+srcColnm(i) to build up the column name $ can't be used, however, you could have used col("A."+srcColnm(i)).

Minus logic implementation not working with spark/scala

Minus Logic in Hive:
The below (Hive)query will return only records available in left side table ( Full_Table ft), but not in both.
Select ft.* from Full_Table ft left join Stage_Table stg where stg.primary_key1 IS null and stg.primary_key2 IS null
I tried to implement the same in spark/scala using following method ( To support both primary key and composite key ) , But joined result set does not have column from right table, because of that not able to apply stg.primary_key2 IS null condition in joined result set.
ft.join(stg,usingColumns, “left_outer”) // used seq to support composite key column join
Please suggest me how to implement minus logic in spark scala.
Thanks,
Saravanan
https://www.linkedin.com/in/saravanan303/
If your tables have the same columns you can use except method from DataSet:
fullTable.except(stageTable)
If they don't have, but you are interested only on subset of columns that exists in both tables you can first select those column using select transformation and than use except:
val fullTableSelectedColumns = fullTable.select(c1,c2,c3)
val stageTableSelectedColumns = stageTable.select(c1,c2,c3)
fullTableSelectedColumns.except(stageTableSelectedColumns)
On other case, you can use join and filter transformations:
fullTable
.join(stageTable, fullTable("primary_key") === stageTable("primary_key"), "left")
.filter(stageTable("primary_key1").isNotNull)