I am trying to join two DataFrame with condition.
I have two dataframe A and B.
A contains id,m_cd and c_cd columns
B contains m_cd,c_cd and record columns
Conditions are -
If m_cd is null then join c_cd of A with B
If m_cd is not null then join m_cd of A with B
we can use "when" and "otherwise()" in withcolumn() method of dataframe, so is there any way to do this for the case of join in dataframe.
I have already done this using Union.But wanted to know if there any other option available.
You can use the "when" / "otherwise" in the join condition:
case class Foo(m_cd: Option[Int], c_cd: Option[Int])
val dfA = spark.createDataset(Array(
Foo(Some(1), Some(2)),
Foo(Some(2), Some(3)),
Foo(None: Option[Int], Some(4))
))
val dfB = spark.createDataset(Array(
Foo(Some(1), Some(5)),
Foo(Some(2), Some(6)),
Foo(Some(10), Some(4))
))
val joinCondition = when($"a.m_cd".isNull, $"a.c_cd"===$"b.c_cd")
.otherwise($"a.m_cd"===$"b.m_cd")
dfA.as('a).join(dfB.as('b), joinCondition).show
It might still be more readable to use the union, though.
In case someone is trying to do it in PySpark here's the syntax
join_condition = when(df1.azure_resourcegroup.startswith('a_string'),df1.some_field == df2.somefield)\
.otherwise((df1.servicename == df2.type) &
(df1.resourcegroup == df2.esource_group) &
(df1.subscriptionguid == df2.subscription_id))
df1 = df1.join(df2,join_condition,how='left')
Related
Dynamically select multiple columns while joining different Dataframe in scala spark
From the above link , I was able to have the join expression working , but what if the column names are different, we cannot use Seq(columns) and need to join it dynamically. Here left_ds and right_ds are the dataframes which I wanted to join.
Below I want to join columns id=acc_id and "acc_no=number"
left_da => id,acc_no,name,ph
right_ds => acc_id,number,location
val joinKeys="id,acc_id|acc_no,number"
val joinKeyPair: Array[(String, String)] = joinKeys.split("\\|").map(_.split(",")).map(x => x(0).toUpperCase -> x(1).toUpperCase)
val joinExpr: Column = joinKeyPair.map { case (ltable_col, rtable_col) =>left_ds.col(ltable_col) === right_ds.col(rtable_col)}.reduce(_ and _)
left_ds.join(right_ds, joinExpr, "left_outer")
Above is the join expression I was trying but it not working. Is there a way to achieve this if the join column names are different with out using Seq. So if the number of join keys increase ,I should still be able to make the code work dynamically.
With aliases have to work fine:
val conditionArrays = joinKeys.split("\\|").map(c => c.split(","))
val joinExpr = conditionArrays.map { case Array(a, b) => col("a." + a) === col("b." + b) }.reduce(_ and _)
left_ds.alias("a").join(right_ds.alias("b"), joinExpr, "left_outer")
I have two dataframes in Spark Scala, but one of these is composed by a unique column. I have to join them but they have no column in common. The number of row is the same.
val userFriends=userJson.select($"friends",$"user_id")
val x = userFriends("friends")
.rdd
.map(x => x.getList(0).toArray.map(_.toString))
val y = x.map(z=>z.count(z=>true)).toDF("friendCount")
I have to join userFriends with y
It's not possible to join them without common fields, except if you can rely on a ordering, in this case you can use row-number (with window-function) on both dataframes and join on the row-number.
But in your case this does not seem necessary, just keep the user_id column in your dataframe, something like this should work:
val userFriends=userJson.select($"friends",$"user_id")
val result_df =
userFriends.select($"friends",$"user_id")
.rdd
.map(x => (x.getList(0).toArray.map(_.toString).count(z=>true)),x.getInt(1)))
.toDF("friendsCount","user_id")
I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)
I have two RDD's:
The first (User ID, Mov ID, Rating, Timestamp)
data_wo_header: RDD[String]
scala> data_wo_header.take(5).foreach(println)
1,2,3.5,1112486027
1,29,3.5,1112484676
1,32,3.5,1112484819
1,47,3.5,1112484727
1,50,3.5,1112484580
and RDD2 (User ID, Mov ID)
data_test_wo_header: RDD[String]
scala> data_test_wo_header.take(5).foreach(println)
1,2
1,367
1,1009
1,1525
1,1750
I need to join the two RDD's such that joining will remove the entries (UserID, Mov ID) common from RDD1.
Can someone guide a scala-spark join for the two RDD's.
Also I'll need a join where the new RDD derived from RDD1 has common items only.
First convert your RDDs to DataFrame because DataFrame has common sql like APIs such as join, select etc.
To convert your RDDs to DataFrame you need a RDD[Row] instead of RDD[String].
Import sqlContext.implicits._
case class cs1(UserID: Int, MovID: Int, Rating: String, Timestamp: String)
case class cs2(UserID: Int, MovID: Int)
val df1 = data_wo_header.map(row => {
val splits = row.split(",")
cs1(splits(0).toInt, splits(1).toInt, splits(2),splits(3))
}).toDF("UserID", "MovID", "Rating", "Timestamp")
val df2 = data_test_wo_header.map(row => {
val splits = row.split(",")
cs2(splits(0).toInt, splits(1).toInt)
}).toDF("UserID", "MovID")
Now, add a new column to df2,
val df2Prime = df2.withColumn("isPresent", lit(1))
Then left join df2Prime with df1 and filter out the rows where isPresent is 1 and you have the intersected result. Also, drop the temporary is isPresent flag.
val temp = df1.join(df2Prime, usingColumns = Seq("UserID", "MovID"), "left")
temp.filter(temp("isPresent") =!= "1").drop("isPresent")
A super easy approach would be to use subtract by keys. Following is working for me:
val data_wo_header=dropheader(data).map(_.split(",")).map(x=>((x(0),x(1)),(x(2),x(3))))
val data_test_wo_header=dropheader(data_test).map(_.split(",")).map(x=>((x(0),x(1)),1))
val ratings_train=data_wo_header.subtractByKey(data_test_wo_header)
val ratings_test=data_wo_header.subtractByKey(ratings_train)
I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.
For example:
DataFrame A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be:
DataFrame C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala?
Try:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
Assuming dfA and dfB have 2 columns emp and sal. You can use the following:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp