check condition for two column in two different dataframes in spark - scala

Suppose there is one column in dataframe and there is similar schema column in another dataframe. how to check check the values consisting in the columns are same or not without joining them as there is not common attribute.
DF1
serial_nm
abc
mnc
pqr
DF2
ser_nm
hgf
mnc
uio
pqr
lok
And i want third DF3 as output
DF3
mnc
pqr
I tried this
val DF3 = DF1.filter(DF1("serial_nm") === DF2("ser_nm"))
But its not working
Please Help
Thanks..!!

I believe you can use a join. Consider using it like this:
val DF3 = DF1.join(DF2, DF1("serial_nm") === DF2("ser_nm"))
or
val DF3 = DF1.join(DF2).where(DF1("serial_nm") === DF2("ser_nm"))
Both approaches are quivalent.
Note: To avoid problems with ambiguous columns, one option is to rename them before the join:
val df2_renamed = DF2
.withColumnRenamed("mnc", "df2_mnc")
.withColumnRenamed("pqr", "df2_pqr")

Related

How to concat two dataframes in which one is having record and other one is empty in pyspark?

I need help to concat two dataframes in which one is empty and other one having the data. Could you please how to do this in pyspark?
pandas I am using:
suppose df2 is empty and df1 is having some record.
df2 = pd.concat([df2, df1])
But how to perform this operation in pyspark?
df1:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
df2:
++
||
++
++
I tried many option. One option worked for me.
For concat df2 to df1, first I need to create the structure of df2 same like df1 then use the union for concatanation.
df2 = sqlContext.createDataFrame(sc.emptyRDD(), df1.schema)
df2 = df2.union(df1)
result:
df2:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
You can use the union method:
df = df1.union(df2)

Difference in SparkSQL Dataframe columns

How do I locate difference between 2 dataframe columns ?
This is causing issues when I join 2 dataframes.
df1_cols = df1.columns
df2_cols = df2.columns
This will return columns for 2 dataframe in 2 list variables.
Thanks
df.columns returns a list here, so you can use any tool in python to compare with another list, i.e. df2_cols. e.g. You can use set to check the common columns in the two DataFrames
df1_cols = df1.columns
df2_cols = df2.columns
set(df1_cols).intersection(set(df2_cols)) # check common columns
set(df1_cols) - set(df2_cols) # check columns in df1 but not in df2
set(df2_cols) - set(df1_cols) # check columns in df2 but not in df1

Drop list of Column from a single dataframe in spark

I have a Dataframe resulting from a join of two Dataframes: df1 and df2 into df3. All the columns found in df2 are also in df1, but their contents differ. I'd like to remove all the df1 columns which names are in df2.columns from the join. Would there be a way to do this without using a var?
Currently I've done this
var ret = df3
df2.columns.foreach(coln => ret = ret.drop(df2(coln)))
but what I really want is just a shortcut for
df3.drop(df1(df2.columns(1))).drop(df1(df2.columns(2)))....
without using a var.
Passing a list of columns is not an option, don't know if it's because I'm using spark 2.2
EDIT:
Important note: I don't know in advance the columns of df1 and df2
This is possible to achieve while you are performing the join itself. Please try the below code
val resultDf=df1.alias("frstdf").join(broadcast(df2).alias("scndf"), $"frstdf.col1" === $"scndf.col1", "left_outer").selectExpr("scndf.col1","scndf.col2"...)//.selectExpr("scndf.*")
This would only contain the columns from the second data frame. Hope this helps
A shortcut would be:
val ret = df2.columns.foldLeft(df3)((acc,coln) => acc.drop(df2(coln)))
I would suggest to remove the columns before the join. Alternatively, select only the columns from df3 which come from df2:
val ret = df3.select(df2.columns.map(col):_*)

Spark scala : select column name from other dataframe

There are two json and first json has more column and always it is super set.
val df1 = spark.read.json(sqoopJson)
val df2 = spark.read.json(kafkaJson)
Except Operation :
I like to apply except operation on both df1 and df2, But df1 has 10 column and df2 has only 8 columns.
In case manually if i drop 2 column from df1 then except will work. But I have 50+ tables/json and need to do EXCEPT for all 50 set of tables/json.
Question :
How to select only columns available in DF2 ( 8) columns from DF1 and create new df3? So df3 will have data from df1 with limited column and it will match with df2 columns.
For the Question: How to select only columns available in DF2 ( 8) columns from DF1 and create new df3?
//Get the 8 column names from df2
val columns = df2.schema.fieldNames.map(col(_))
//select only the columns from df2
val df3 = df1.select(columns :_*)
Hope this helps!

Rename column names when select from dataframe

I have 2 dataframes : df1 and df2 and I am left joining both of them on id column and saving it to another dataframe named df3. Below is the code that I am using, which works fine as expected.
val df3 = df1.alias("tab1").join(df2.alias("tab2"),Seq("id"),"left_outer").select("tab1.*","tab2.name","tab2.dept","tab2.descr");
I would like to rename the tab2.descr column to dept_full_description within the above statement.
I am aware that I could create a seq val like below and use toDF method
val columnsRenamed = Seq("id", "empl_name", "name","dept","dept_full_description") ;
df4 = df3.toDF(columnsRenamed: _*);
Is there any other way to to aliasing in the first statement itself. My end goal is not to list about 30-40 columns explicitly .
I'd rename before join:
df1.alias("tab1").join(
df2.withColumnRenamed("descr", "dept_full_description").alias("tab2"),
Seq("id"), "left_outer")