Spark scala join two dataframes on columns with different name - scala

I'm trying to join two dataframes on a column with two different names :
def joinTwoDataframes(X:DataFrame)(Y:DataFrame): DataFrame = {
X.join(Y,X("ssp") === Y("impbus_ssp_value")) }
The error is: "Cannot resolve column name "ssp" among (impbus_ssp_value, col2);"
Thanks!

Related

Join two dataframe using Spark Scala

I have this code :
val o = p_value.alias("d1").join(t_d.alias("d2"),
(col("d1.origin_latitude")===col("d2.origin_latitude")&&
col("d1.origin_longitude")===col("d2.origin_longitude")),"left").
filter(col("d2.origin_longitude").isNull)
val c = p_value2.alias("d3").join(o.alias("d4"),
(col("d3.origin_latitude")===col("d4.origin_latitude") &&
col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
filter(col("d3.origin_longitude").isNull)
I get this error :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Reference 'd4.origin_latitude' is ambiguous, could be: d4.origin_latitude, d4.origin_latitude.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:101)
On this line
(col("d3.origin_latitude")===col("d4.origin_latitude") && col("d3.origin_longitude")===col("d4.origin_longitude")),"left").
Any idea ?
Thank you .
You are aliasing DataFrame not columns, which is used to access/refer columns in that DataFrame.
So the first join will result into another DataFrame having same column name twice (origin_latitude as well as origin_longitude). Once you try to access one of these columns in resultant DataFrame, you are going to get Ambiguity error.
So you need to make sure that DataFrame contains each column only once.
You can rewrite the first join as below:
p_value
.join(t_d, Seq("origin_latitude", "origin_longitude"), "left")
.filter(t_d.col("t_d.origin_longitude").isNull)

Dynamic boolean join in pyspark

I have two pyspark dataframe with same schema as below -
df_source:
id, name, age
df_target:
id,name,age
"id" is primary column in both the tables and rest are attribute columns
i am accepting primary and attribute column list from user as below-
primary_columns = ["id"]
attribute_columns = ["name","age"]
I need to join above two dataframes dynamically as below -
df_update = df_source.join(df_target, (df_source["id"] == df_target["id"]) & ((df_source["name"] != df_target["name"]) | (df_source["age"] != df_target["age"])) ,how="inner").select([df_source[col] for col in df_source.columns])
how can i achieve this join condition dynamically in pyspark since number of attribute and primary key columns can change as per the user input? Please help.
IIUC, you just can achieve the desired output using an inner join on the primary_columns and a where clause that loops over the attribute_columns.
Since the two DataFrames have the same column names, use alias to differentiate column names after the join.
from functools import reduce
from pyspark.sql.functions import col
df_update = df_source.alias("s")\
.join(df_target.alias("t"), on=primary_columns, how="inner")\
.where(
reduce(
lambda a, b: a|b,
[(col("s."+c) != col("t."+c) for c in attribute_columns]
)\
)
.select("s.*")
Use reduce to apply the bitwise OR operation on the columns in attribute_columns.

Spark Scala Populate Dataframe from Select query result

Let's say I have a two tables, one for students (tbl_students) and another for exams (tbl_exams). In vanilla SQL with a relational database, I can be able to use an outer join to find the list of students who have missed a particular exam, since the student_id won't match any row in the exam table for a that particular exam_id. I could also insert the result of this outer join query into another table using the SELECT INTO syntax.
With that background, can I be able to achieve a similar result using spark sql and scala, where I can populate a dataframe using the result of an outer join? Example code is (the code is not tested and may not run as is):
//Create schema for single column
val schema = StructType(
StructField("student_id", StringType, true)
)
//Create empty RDD
var dataRDD = sc.emptyRDD
//pass rdd and schema to create dataframe
val joinDF = sqlContext.createDataFrame(dataRDD, schema);
joinDF.createOrReplaceTempView("tbl_students_missed_exam");
//Populate tbl_students_missed_exam dataframe using result of outer join
sparkSession.sql(s"""
SELECT tbl_students.student_id
INTO tbl_students_missed_exam
FROM tbl_students
LEFT OUTER JOIN tbl_exams ON tbl_students.student_id = tbl_exams.exam_id;""")
Thanks in advance for your input

Variable substitution in scala

I have two dataframes in scala both having data from two different tables but of same structure (srcdataframe and tgttable). I have to join these two based on composite primary key and select few columns and append two columns the code for which is as below:
for(i <- 2 until numCols) {
srcdataframe.as("A")
.join(tgttable.as("B"), $"A.INSTANCE_ID" === $"B.INSTANCE_ID" &&
$"A.CONTRACT_LINE_ID" === $"B.CONTRACT_LINE_ID", "inner")
.filter($"A." + srcColnm(i) =!= $"B." + srcColnm(i))
.select($"A.INSTANCE_ID",
$"A.CONTRACT_LINE_ID",
"$"+"\""+"A."+srcColnm(i)+"\""+","+"$"+"\""+"B."+srcColnm(i)+"\"")
.withColumn("MisMatchedCol",lit("\""+srcColnm(i)+"\""))
.withColumn("LastRunDate",current_timestamp.cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch");
hiveSQLContext.sql("Insert into table xxxx.f2f_Mismatch1 select t.* from (select * from IPF_1M_Mismatch) t");}
Here are the things am trying to do:
Inner join of srcdataframe and tgttable based on instance_id and contract_line_id.
Select only instance_id, contract_line_id, mismatched_col_values, hardcode of mismatched_col_nm, timestamp.
srcColnm(i) is an array of strings which contains the non-primary keys to be compared.
However, I am not able to resolve the variables inside the dataframe in the for loop. I tried looking up for solutions here and here. I got to know that it may be because of the way spark substitutes the variables only at compile time, in this case I'm not sure how to resolve it.
Instead of creating columns with $, you can simply use strings or the col() function. I would also recommend performing the join outside of the for as it's an expensive operation. Slightly changed code, the main difference to solve your problem is in the select:
val df = srcdataframe.as("A")
.join(tgttable.as("B"), Seq("INSTANCE_ID", "CONTRACT_LINE_ID"), "inner")
for(columnName <- srcColnm) {
df.filter(col("A." + columnName) =!= col("B." + columnName))
.select("INSTANCE_ID", "CONTRACT_LINE_ID", "A." + columnName, "B." + columnName)
.withColumn("MisMatchedCol", lit(columnName))
.withColumn("LastRunDate", current_timestamp().cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch")
// Hive command
}
Regarding the problem in select:
$ is short for the col() function, it's selecting a column in the dataframe by name. The problem in the select is that the two first arguments col("A.INSTANCE_ID") and col("A.CONTRACT_LINE_ID") are two columns ($replaced bycol()` for clarity).
However, the next two arguments are strings. It is not possible to mix these two, either all arguments should be columns or all are strings. As you used "A."+srcColnm(i) to build up the column name $ can't be used, however, you could have used col("A."+srcColnm(i)).

Join Dataframes in Spark

I have joined two Dataframes in spark using below code -
Dataframes are: expDataFrame, accountList
val expDetails = expDataFrame.as("fex").join(accountList.as("acctlist"),$"fex.acct_id" === $"acctlist.acct_id", "inner")
Now I am trying to show both acct_id from both dataframe.
I have done below code -
expDetails.select($"fex.acct_id",$"acct_id.acct_id").show
but getting same column name twice as acct_id
I want two unique column name like fex_acct_id, acctlist_acct_id to identify the column from which dataframe.
You simply have to add an alias to the columns using the as or alias methods. This will do the job :
expDetails.select(
$"fex.acct_id".as("fex_acct_id"),
$"acct_id.acct_id".as("acctlist_acct_id")
).show