Dynamically select multiple columns while joining different Dataframe in Scala Spark - scala

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.

It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.

A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

Related

Spark 2.3: subtract dataframes but preserve duplicate values (Scala)

Copying example from this question:
As a conceptual example, if I have two dataframes:
words = [the, quick, fox, a, brown, fox]
stopWords = [the, a]
then I want the output to be, in any order:
words - stopWords = [quick, brown, fox, fox]
ExceptAll can do this in 2.4 but I cannot upgrade. The answer in the linked question is specific to a dataframe:
words.join(stopwords, words("id") === stopwords("id"), "left_outer")
.where(stopwords("id").isNull)
.select(words("id")).show()
as in you need to know the pkey and the other columns.
Can anyone come up with an answer that will work on any dataframe?
Here is an implementation for you all. I have tested in Spark 2.4.2, it should work for 2.3 too (not 100% sure)
val df1 = spark.createDataset(Seq("the","quick","fox","a","brown","fox")).toDF("c1")
val df2 = spark.createDataset(Seq("the","a")).toDF("c1")
exceptAllCustom(df1, df2, Seq("c1")).show()
def exceptAllCustom(df1 : DataFrame, df2 : DataFrame, pks : Seq[String]): DataFrame = {
val notNullCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName).isNull)
val joinCondition = pks.foldLeft(lit(0==0))((column,cName) => column && df2(cName)=== df1(cName))
val result = df1.join(df2, joinCondition, "left_outer")
.where(notNullCondition)
pks.foldLeft(result)((df,cName) => df.drop(df2(cName)))
}
Result -
+-----+
| c1|
+-----+
|quick|
| fox|
|brown|
| fox|
+-----+
Turns out it's easier to do df1.except(df2) and then join the results with df1 to get all the duplicates.
Full code:
def exceptAllCustom(df1: DataFrame, df2: DataFrame): DataFrame = {
val except = df1.except(df2)
val columns = df1.columns
val colExpr: Column = df1(columns.head) <=> except(columns.head)
val joinExpression = columns.tail.foldLeft(colExpr) { (colExpr, p) =>
colExpr && df1(p) <=> except(p)
}
val join = df1.join(except, joinExpression, "inner")
join.select(df1("*"))
}

Spark scala : select column name from other dataframe

There are two json and first json has more column and always it is super set.
val df1 = spark.read.json(sqoopJson)
val df2 = spark.read.json(kafkaJson)
Except Operation :
I like to apply except operation on both df1 and df2, But df1 has 10 column and df2 has only 8 columns.
In case manually if i drop 2 column from df1 then except will work. But I have 50+ tables/json and need to do EXCEPT for all 50 set of tables/json.
Question :
How to select only columns available in DF2 ( 8) columns from DF1 and create new df3? So df3 will have data from df1 with limited column and it will match with df2 columns.
For the Question: How to select only columns available in DF2 ( 8) columns from DF1 and create new df3?
//Get the 8 column names from df2
val columns = df2.schema.fieldNames.map(col(_))
//select only the columns from df2
val df3 = df1.select(columns :_*)
Hope this helps!

How to join multiple dataFrames in spark with different column names and types without converting into RDD

My df1 has column of type Double, df2 has column of type Timestamp and df3 has column of type Integer.
I'm trying to achieve something like this:
df1 = ...
df2 = ...
df3 = ...
val df4 = df1.zip(df2).zip(df3)
However there's no such function like "zip". How can I archive such result?
There's no explicit zip for DataFrames. You can do workaround:
val df1Ordered = df1.withColumn("rowNr", row_number().over(Window.orderBy('someColumn));
// the same for other DataFrames
// now join those DataFrames
val newDF = df1Ordered.join(df2Ordered, "rowNr").join("df3Ordered", "rowNr")
However it will be quite slow, because there is no partitionBy in Window operation.

How to join two dataframes in Scala and select on few columns from the dataframes by their index?

I have to join two dataframes, which is very similar to the task given here Joining two DataFrames in Spark SQL and selecting columns of only one
However, I want to select only the second column from df2. In my task, I am going to use the join function for two dataframes within a reduce function for a list of dataframes. In this list of dataframes, the column names will be different. However, in each case I would want to keep the second column of df2.
I did not find anywhere how to select a dataframe's column by their numbered index. Any help is appreciated!
EDIT:
ANSWER
I figured out the solution. Here is one way to do this:
def joinDFs(df1: DataFrame, df2: DataFrame): DataFrame = {
val df2cols = df2.columns
val desiredDf2Col = df2cols(1) // the second column
val df3 = df1.as("df1").join(df2.as("df2"), $"df1.time" === $"df2.time")
.select($"df1.*",$"df2.$desiredDf2Col")
df3
}
And then I can apply this function in a reduce operation on a list of dataframes.
var listOfDFs: List[DataFrame] = List()
// Populate listOfDFs as you want here
val joinedDF = listOfDFs.reduceLeft((x, y) => {joinDFs(x, y)})
To select the second column in your dataframe you can simply do:
val df3 = df2.select(df2.columns(1))
This will first find the second column name and then select it.
If the join and select methods that you want to define in reduce function is similar to Joining two DataFrames in Spark SQL and selecting columns of only one Then you should do the following :
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1) map d2.columns map col: _*)
You will have to remember that the name of the second column i.e. Seq(1) should not be same as any of the dataframes column names.
You can select multiple columns as well but remember the bold note above
import org.apache.spark.sql.functions._
d1.as("d1").join(d2.as("d2"), $"d1.id" === $"d2.id").select(Seq(1, 2) map d2.columns map col: _*)

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)