Join Dataframes dynamically using Spark Scala when JOIN columns differ - scala

Dynamically select multiple columns while joining different Dataframe in scala spark
From the above link , I was able to have the join expression working , but what if the column names are different, we cannot use Seq(columns) and need to join it dynamically. Here left_ds and right_ds are the dataframes which I wanted to join.
Below I want to join columns id=acc_id and "acc_no=number"
left_da => id,acc_no,name,ph
right_ds => acc_id,number,location
val joinKeys="id,acc_id|acc_no,number"
val joinKeyPair: Array[(String, String)] = joinKeys.split("\\|").map(_.split(",")).map(x => x(0).toUpperCase -> x(1).toUpperCase)
val joinExpr: Column = joinKeyPair.map { case (ltable_col, rtable_col) =>left_ds.col(ltable_col) === right_ds.col(rtable_col)}.reduce(_ and _)
left_ds.join(right_ds, joinExpr, "left_outer")
Above is the join expression I was trying but it not working. Is there a way to achieve this if the join column names are different with out using Seq. So if the number of join keys increase ,I should still be able to make the code work dynamically.

With aliases have to work fine:
val conditionArrays = joinKeys.split("\\|").map(c => c.split(","))
val joinExpr = conditionArrays.map { case Array(a, b) => col("a." + a) === col("b." + b) }.reduce(_ and _)
left_ds.alias("a").join(right_ds.alias("b"), joinExpr, "left_outer")

Related

How to change column type for a list of dataframe columns

I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.

Join two Dataframe without a common field in Spark-scala

I have two dataframes in Spark Scala, but one of these is composed by a unique column. I have to join them but they have no column in common. The number of row is the same.
val userFriends=userJson.select($"friends",$"user_id")
val x = userFriends("friends")
.rdd
.map(x => x.getList(0).toArray.map(_.toString))
val y = x.map(z=>z.count(z=>true)).toDF("friendCount")
I have to join userFriends with y
It's not possible to join them without common fields, except if you can rely on a ordering, in this case you can use row-number (with window-function) on both dataframes and join on the row-number.
But in your case this does not seem necessary, just keep the user_id column in your dataframe, something like this should work:
val userFriends=userJson.select($"friends",$"user_id")
val result_df =
userFriends.select($"friends",$"user_id")
.rdd
.map(x => (x.getList(0).toArray.map(_.toString).count(z=>true)),x.getInt(1)))
.toDF("friendsCount","user_id")

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Conditional Join in Spark DataFrame

I am trying to join two DataFrame with condition.
I have two dataframe A and B.
A contains id,m_cd and c_cd columns
B contains m_cd,c_cd and record columns
Conditions are -
If m_cd is null then join c_cd of A with B
If m_cd is not null then join m_cd of A with B
we can use "when" and "otherwise()" in withcolumn() method of dataframe, so is there any way to do this for the case of join in dataframe.
I have already done this using Union.But wanted to know if there any other option available.
You can use the "when" / "otherwise" in the join condition:
case class Foo(m_cd: Option[Int], c_cd: Option[Int])
val dfA = spark.createDataset(Array(
Foo(Some(1), Some(2)),
Foo(Some(2), Some(3)),
Foo(None: Option[Int], Some(4))
))
val dfB = spark.createDataset(Array(
Foo(Some(1), Some(5)),
Foo(Some(2), Some(6)),
Foo(Some(10), Some(4))
))
val joinCondition = when($"a.m_cd".isNull, $"a.c_cd"===$"b.c_cd")
.otherwise($"a.m_cd"===$"b.m_cd")
dfA.as('a).join(dfB.as('b), joinCondition).show
It might still be more readable to use the union, though.
In case someone is trying to do it in PySpark here's the syntax
join_condition = when(df1.azure_resourcegroup.startswith('a_string'),df1.some_field == df2.somefield)\
.otherwise((df1.servicename == df2.type) &
(df1.resourcegroup == df2.esource_group) &
(df1.subscriptionguid == df2.subscription_id))
df1 = df1.join(df2,join_condition,how='left')

Joining multiple pairedrdds

I have a question regarding joining multiple rdds simultaneously. I have about 8 paired rdds of datatype: RDD [(String, mutable.HashSet[String])]. I would like to join them by key. I can join 2 using spark's join or cogroup?
However, is there a build-in way to do this? I can join two-at a time and then join the result rdd with the next one, however if there is any better way, would like to use that.
There is no built-in method to join multiple RDDs. Assuming this question is related to the previous one and you want to combine sets for each key you can simply use union followed by reduceByKey:
val rdds = Seq(rdd1, rdd2, ..., rdd8)
val combined: RDD[(String, mutable.HashSet[String])] = sc
.union(rdds)
.reduceByKey(_ ++ _)
If not you can try to reduce a collection of RDDs:
val combined: RDD[(String, Seq[mutable.HashSet[String]])] = rdds
.map(_.mapValues(s => Seq(s)))
.reduce((a, b) => a.join(b).mapValues{case (s1, s2) => s1 ++ s2})