Join multiple DataFrames in parallel using sparksql - scala

I have df1, df2, df3... upto df10. All the df have uniqueId column common and each df has another column
df1=> (uniqueId,firstName)
df2=> (uniqueId,middleName)
df3=> (uniqueId,lastName)
and so on..
Currently I am storing the data frames in two lists
df1 to df5 in list1 = ListBuffer[DataFrame]() and
df6 to df10 in list2 = ListBuffer[DataFrame]()
followed by
val finalDf1 = list1
.reduceLeft((df1, df2) => df1.join(df2, uniqueIdentifier, FULL_OUTER))
val finalDf2 = list2
.reduceLeft((df1, df2) => df1.join(df2, uniqueIdentifier, FULL_OUTER))
finalDf1
.join(finalDf2, uniqueIdentifier, FULL_OUTER).show()
In this case still df1 to df5 join is sequential, similarly df6 to df10. How can I avoid this sequential join and parallelize as must as possible?

Related

how to replace distinct() with reducebykey

I have a scenario where the below code overall take more than 10 hours for >2 Billion records. even i tried with 35 instance of the i3 cluster but still the performance was bad. I am looking for an option to replace distinct() with reduceByKey() and to get suggestion to improve the performance...
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes")
val df2 = df1.
repartition(
List(col("ID"), col("col2"), col("suffix"), col("date"),
col("year"), col("codes")): _*
).distinct()
val df3 = df2.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df3.createOrReplaceTempView("df3")
val df4 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df3
LATERAL VIEW explode(codes) exploded_table as d
""")
df4.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)
I think distinct() is implemented with reduceByKey(reduce), but if you want to implement it by yourself, you could do something
val array=List((1,2),(1,3),(1,5),(1,2),(2,2),(2,2),(3,2),(3,2),(4,1),(1,3))
val pairRDD=session.sparkContext.parallelize(array)
val distinctResult=pairRDD.map(x => (x, null)).reduceByKey((x, _) => x)

Spark Join of 2 dataframes which have 2 different column names in list

Is there a way to join two Spark Dataframes with different column names via 2 lists?
I know that if they had the same names in a list I could do the following:
val joindf = df1.join(df2, Seq("col_a", "col_b"), "left")
or if I knew the different column names I could do this:
df1.join(
df2,
df1("col_a") <=> df2("col_x")
&& df1("col_b") <=> df2("col_y"),
"left"
)
Since my method is expecting inputs of 2 lists which specify which columns are to be used for the join for each DF, I was wondering if Scala Spark had a way of doing this?
P.S
I'm looking for something like a python pandas merge:
joindf = pd.merge(df1, df2, left_on = list1, right_on = list2, how = 'left')
You can easely define such a method yourself:
def merge(left: DataFrame, right: DataFrame, left_on: Seq[String], right_on: Seq[String], how: String) = {
import org.apache.spark.sql.functions.lit
val joinExpr = left_on.zip(right_on).foldLeft(lit(true)) { case (acc, (lkey, rkey)) => acc and (left(lkey) === right(rkey)) }
left.join(right, joinExpr, how)
}
val df1 = Seq((1, "a")).toDF("id1", "n1")
val df2 = Seq((1, "a")).toDF("id2", "n2")
val joindf = merge(df1, df2, left_on = Seq("id1", "n1"), right_on = Seq("id2", "n2"), how = "left")
If you expect two lists of strings:
val leftOn = Seq("col_a", "col_b")
val rightOn = Seq("col_x", "coly")
Just zip and reduce:
import org.apache.spark.sql.functions.col
val on = leftOn.zip(rightOn)
.map { case (x, y) => df1(x) <=> df2(y) }
.reduce(_ and _)
df1.join(df2, on, "left")

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.
It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.
A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Merging Dataframes in Spark

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.
For example:
DataFrame A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be:
DataFrame C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala?
Try:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
Assuming dfA and dfB have 2 columns emp and sal. You can use the following:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp