Comparing two dataframes to make a new one in pyspark - pyspark

I'm new to PySpark and thus the question.
I have two dataframes df1 and df2 with columns A, B and C.
Only Col C can have different values in these two dataframes.
How do I compare the df1 and df2 and create df3 with columns A, B C which only has rows where the value of C is different between A and B
Any help appreciated.

Inner join and filter
from pyspark.sql.functions import col
df1.alias("df1").join(df2.alias("df2"), ["a", "b"]).where(col("df1.c") != col("df2.c"))
If you want to handle missing values as well
df1.alias("df1").join(df2.alias("df2"), ["a", "b"], "fullouter").where(
~col("df1.c").eqNullSafe(col("df2.c"))
)

Related

Difference in SparkSQL Dataframe columns

How do I locate difference between 2 dataframe columns ?
This is causing issues when I join 2 dataframes.
df1_cols = df1.columns
df2_cols = df2.columns
This will return columns for 2 dataframe in 2 list variables.
Thanks
df.columns returns a list here, so you can use any tool in python to compare with another list, i.e. df2_cols. e.g. You can use set to check the common columns in the two DataFrames
df1_cols = df1.columns
df2_cols = df2.columns
set(df1_cols).intersection(set(df2_cols)) # check common columns
set(df1_cols) - set(df2_cols) # check columns in df1 but not in df2
set(df2_cols) - set(df1_cols) # check columns in df2 but not in df1

Joins in pyspark different columns

how do I join a pyspark dataframe on two different columns?
Cols df1: ID,DATE
cols df2: user,DATE
I want to Join df1.ID==df2.user and df1.DATE==df2.DATE
Joindf = df1.join(df2.withColumnRenamed("ID","user"), ["ID","DATE"])
should do it for you.

convert datatypes for respective columns as per the dataframe

I have a pysaprk dataframe with 100 cols:
df1=[(col1,string),(col2,double),(col3,bigint),..so on]
I have another pyspark dataframe df2 with same col count and col names but different datatypes.
df2=[(col1,bigint),(col2,double),(col3,string),..so on]
how do i make the dataypes of all the cols in df2 same as ones present in the dataframe df1 for their respective cols?
It should happen iteratively and if the datatypes match then it should not change
If as you said the column names match and columns count match, then you can simply loop in the schema of df1 and cast the columns to dataTypes of df1
df2 = df2.select([F.col(c.name).cast(c.dataType) for c in df1.schema])
You can use the cast function:
from pyspark.sql import functions as f
# get schema for each DF
df1_schema=df1.dtypes
df2_schema=df2.dtypes
# iterate through cols to cast columns which differ in type
for (c1, d1), (c2,d2) in zip(df1_schema, df2_schema):
# check if datatypes are the same, otherwise cast
if d1!=d2:
df2=df2.withColumn(c2, f.col(c2).cast(d2))

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

Merging Dataframes in Spark

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.
For example:
DataFrame A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be:
DataFrame C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala?
Try:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
Assuming dfA and dfB have 2 columns emp and sal. You can use the following:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp