Merging Dataframes in Spark - scala

I've 2 Dataframes, say A & B. I would like to join them on a key column & create another Dataframe. When the keys match in A & B, I need to get the row from B, not from A.
For example:
DataFrame A:
Employee1, salary100
Employee2, salary50
Employee3, salary200
DataFrame B
Employee1, salary150
Employee2, salary100
Employee4, salary300
The resulting DataFrame should be:
DataFrame C:
Employee1, salary150
Employee2, salary100
Employee3, salary200
Employee4, salary300
How can I do this in Spark & Scala?

Try:
dfA.registerTempTable("dfA")
dfB.registerTempTable("dfB")
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
coalesce(dfB.salary, dfA.salary) FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")
or
sqlContext.sql("""
SELECT coalesce(dfA.employee, dfB.employee),
CASE dfB.employee IS NOT NULL THEN dfB.salary
CASE dfB.employee IS NOT NULL THEN dfA.salary
END FROM dfA FULL OUTER JOIN dfB
ON dfA.employee = dfB.employee""")

Assuming dfA and dfB have 2 columns emp and sal. You can use the following:
import org.apache.spark.sql.{functions => f}
val dfB1 = dfB
.withColumnRenamed("sal", "salB")
.withColumnRenamed("emp", "empB")
val joined = dfA
.join(dfB1, 'emp === 'empB, "outer")
.select(
f.coalesce('empB, 'emp).as("emp"),
f.coalesce('salB, 'sal).as("sal")
)
NB: you should have only one row per Dataframe for a giving value of emp

Related

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.
It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.
A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

Join two Dataframe without a common field in Spark-scala

I have two dataframes in Spark Scala, but one of these is composed by a unique column. I have to join them but they have no column in common. The number of row is the same.
val userFriends=userJson.select($"friends",$"user_id")
val x = userFriends("friends")
.rdd
.map(x => x.getList(0).toArray.map(_.toString))
val y = x.map(z=>z.count(z=>true)).toDF("friendCount")
I have to join userFriends with y
It's not possible to join them without common fields, except if you can rely on a ordering, in this case you can use row-number (with window-function) on both dataframes and join on the row-number.
But in your case this does not seem necessary, just keep the user_id column in your dataframe, something like this should work:
val userFriends=userJson.select($"friends",$"user_id")
val result_df =
userFriends.select($"friends",$"user_id")
.rdd
.map(x => (x.getList(0).toArray.map(_.toString).count(z=>true)),x.getInt(1)))
.toDF("friendsCount","user_id")

Merge multiple Dataframes into one Dataframe in Spark [duplicate]

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)

Conditional Join in Spark DataFrame

I am trying to join two DataFrame with condition.
I have two dataframe A and B.
A contains id,m_cd and c_cd columns
B contains m_cd,c_cd and record columns
Conditions are -
If m_cd is null then join c_cd of A with B
If m_cd is not null then join m_cd of A with B
we can use "when" and "otherwise()" in withcolumn() method of dataframe, so is there any way to do this for the case of join in dataframe.
I have already done this using Union.But wanted to know if there any other option available.
You can use the "when" / "otherwise" in the join condition:
case class Foo(m_cd: Option[Int], c_cd: Option[Int])
val dfA = spark.createDataset(Array(
Foo(Some(1), Some(2)),
Foo(Some(2), Some(3)),
Foo(None: Option[Int], Some(4))
))
val dfB = spark.createDataset(Array(
Foo(Some(1), Some(5)),
Foo(Some(2), Some(6)),
Foo(Some(10), Some(4))
))
val joinCondition = when($"a.m_cd".isNull, $"a.c_cd"===$"b.c_cd")
.otherwise($"a.m_cd"===$"b.m_cd")
dfA.as('a).join(dfB.as('b), joinCondition).show
It might still be more readable to use the union, though.
In case someone is trying to do it in PySpark here's the syntax
join_condition = when(df1.azure_resourcegroup.startswith('a_string'),df1.some_field == df2.somefield)\
.otherwise((df1.servicename == df2.type) &
(df1.resourcegroup == df2.esource_group) &
(df1.subscriptionguid == df2.subscription_id))
df1 = df1.join(df2,join_condition,how='left')

How to zip two (or more) DataFrame in Spark

I have two DataFrame a and b.
a is like
Column 1 | Column 2
abc | 123
cde | 23
b is like
Column 1
1
2
I want to zip a and b (or even more) DataFrames which becomes something like:
Column 1 | Column 2 | Column 3
abc | 123 | 1
cde | 23 | 2
How can I do it?
Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}
val a: DataFrame = sc.parallelize(Seq(
("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")
// Merge rows
val rows = a.rdd.zip(b.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)
// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)
If above conditions are not met the only option that comes to mind is adding an index and join:
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add index
df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
// Create schema
StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)
// Join and clean
val ab = aWithIndex
.join(bWithIndex, Seq("_index"))
.drop("_index")
In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:
val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)
val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)
aWithId.join(bWithId, "id")
A little light reading - Check out how Python does this!
What about pure SQL ?
SELECT
room_name,
sender_nickname,
message_id,
row_number() over (partition by room_name order by message_id) as message_index,
row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id
I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like #zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.
from pyspark.sql import Row
from pyspark.sql.types import StructType
def zipDataFrames(left, right):
CombinedRow = Row(*left.columns + right.columns)
def flattenRow(row):
left = row[0]
right = row[1]
combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
return CombinedRow(*combinedVals)
zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))
combinedSchema = StructType(left.schema.fields + right.schema.fields)
return zippedRdd.toDF(combinedSchema)
joined = zipDataFrames(a, b)