Scala LEFT JOIN on dataframes using two columns (case insensitive) - scala

I have created the below method which takes two Dataframes; lhs & rhs and their respective first and second columns as input. The method should return the result of a left join between these two frames using the two columns provided for each dataframe (ignoring their case sensitivity).
The problem I am facing is that it is doing more of an inner join. It is is returning 3 times the number of the rows that is in the lhs data frame (due to duplicate values in rhs), but as it is a left join the duplication and number of rows in rhs dataframe should not matter.
def leftJoinCaseInsensitive(lhs: DataFrame, rhs: DataFrame, leftTableColumn: String, rightTableColumn: String, leftTableColumn1: String, rightTableColumn1: String): DataFrame = {
val joined: DataFrame = lhs.join(rhs, upper(lhs.col(leftTableColumn)) === upper(rhs.col(rightTableColumn)) && upper(lhs.col(leftTableColumn1)) === upper(rhs.col(rightTableColumn1)), "left");
return joined
}

If there are duplicate values in rhs, then it is normal for lhs to get replicated. If a joining values in joining columns from lhs row matches with multiple rhs rows then joined dataframe should have multiple rows from lhs matching the rows from rhs.
for example
lhs dataframe
+--------+--------+--------+
|col1left|col2left|col3left|
+--------+--------+--------+
|a |1 |leftside|
+--------+--------+--------+
And
rhs dataframe
+---------+---------+---------+
|col1right|col2right|col3right|
+---------+---------+---------+
|a |1 |rightside|
|a |1 |rightside|
+---------+---------+---------+
Then it is normal to have left join as
left joined lhs with rhs
+--------+--------+--------+---------+---------+---------+
|col1left|col2left|col3left|col1right|col2right|col3right|
+--------+--------+--------+---------+---------+---------+
|a |1 |leftside|a |1 |rightside|
|a |1 |leftside|a |1 |rightside|
+--------+--------+--------+---------+---------+---------+
You can have more information here

but as it is a left join the duplication and number of rows in rhs
dataframe should not matter
Not true. Your leftJoinCaseInsensitive method looks good to me. A left join would still produce more rows than the left table's if the right table has duplicated key column(s), as shown below:
val dfR = Seq(
(1, "a", "x"),
(1, "a", "y"),
(2, "b", "z")
).toDF("k1", "k2", "val")
val dfL = Seq(
(1, "a", "u"),
(2, "b", "v"),
(3, "c", "w")
).toDF("k1", "k2", "val")
leftJoinCaseInsensitive(dfL, dfR, "k1", "k1", "k2", "k2")
res1.show
+---+---+---+----+----+----+
| k1| k2|val| k1| k2| val|
+---+---+---+----+----+----+
| 1| a| u| 1| a| y|
| 1| a| u| 1| a| x|
| 2| b| v| 2| b| z|
| 3| c| w|null|null|null|
+---+---+---+----+----+----+

Related

isin throws stackoverflow error in withcolumn function in spark

I am using spark 2.3 in my scala application. I have a dataframe which create from spark sql that name is sqlDF in the sample code which I shared. I have a string list that has the items below
List[] stringList items
-9,-8,-7,-6
I want to replace all values that match with this lists item in all columns in dataframe to 0.
Initial dataframe
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |-6 |1
-7 |-8 |-7
It must return to
column1 | column2 | column3
1 |1 |1
2 |-5 |1
6 |0 |1
0 |0 |0
For this I am itarating the query below for all columns (more than 500) in sqlDF.
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList:_*), 0).otherwise(col(currColumnName)))
But getting the error below, by the way if I choose only one column for iterating it works, but if I run the code above for 500 columns iteration it fails
Exception in thread "streaming-job-executor-0"
java.lang.StackOverflowError at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
at
scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
at
scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
at
scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
at scala.collection.immutable.List.map(List.scala:285) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
What is the thing that I am missing?
Here is a different approach applying left anti join between columnX and X where X is your list of items transferred into a dataframe. The left anti join will return all the items not present in X, the results we concatenate them all together through an outer join (which can be replaced with left join for better performance, this though will exclude records with all zeros i.e id == 3) based on the id assigned with monotonically_increasing_id:
import org.apache.spark.sql.functions.{monotonically_increasing_id, col}
val df = Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7))
.toDF("c1", "c2", "c3")
.withColumn("id", monotonically_increasing_id())
val exdf = Seq(-9, -8, -7, -6).toDF("x")
df.columns.map{ c =>
df.select("id", c).join(exdf, col(c) === $"x", "left_anti")
}
.reduce((df1, df2) => df1.join(df2, Seq("id"), "outer"))
.na.fill(0)
.show
Output:
+---+---+---+---+
| id| c1| c2| c3|
+---+---+---+---+
| 0| 1| 1| 1|
| 1| 2| -5| 1|
| 3| 0| 0| 0|
| 2| 6| 0| 1|
+---+---+---+---+
foldLeft works perfect for your case here as below
val df = spark.sparkContext.parallelize(Seq(
(1, 1, 1),
(2, -5, 1),
(6, -6, 1),
(-7, -8, -7)
)).toDF("a", "b", "c")
val list = Seq(-7, -8, -9)
val resultDF = df.columns.foldLeft(df) { (acc, name) => {
acc.withColumn(name, when(col(name).isin(list: _*), 0).otherwise(col(name)))
}
}
Output:
+---+---+---+
|a |b |c |
+---+---+---+
|1 |1 |1 |
|2 |-5 |1 |
|6 |-6 |1 |
|0 |0 |0 |
+---+---+---+
I would suggest you to broadcast the list of String :
val stringList=sc.broadcast(<Your List of List[String]>)
After that use this :
sqlDF = sqlDF.withColumn(currColumnName, when(col(currColumnName).isin(stringList.value:_*), 0).otherwise(col(currColumnName)))
Make sure your currColumnName also is in String Format. Comparison should be String to String

Speed up spark dataframe groupBy

I am fairly inexperienced in Spark, and need help with groupBy and aggregate functions on a dataframe. Consider the following dataframe:
val df = (Seq((1, "a", "1"),
(1,"b", "3"),
(1,"c", "6"),
(2, "a", "9"),
(2,"c", "10"),
(1,"b","8" ),
(2, "c", "3"),
(3,"r", "19")).toDF("col1", "col2", "col3"))
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 1|
| 1| b| 3|
| 1| c| 6|
| 2| a| 9|
| 2| c| 10|
| 1| b| 8|
| 2| c| 3|
| 3| r| 19|
+----+----+----+
I need to group by col1 and col2 and calculate the mean of col3, which I can do using:
val col1df = df.groupBy("col1").agg(round(mean("col3"),2).alias("mean_col1"))
val col2df = df.groupBy("col2").agg(round(mean("col3"),2).alias("mean_col2"))
However, on a large dataframe with a few million rows and tens of thousands of unique elements in the columns to group by, it takes a very long time. Besides, I have many more columns to group by and it takes insanely long, which I am looking to reduce. Is there a better way to do the groupBy followed by the aggregation?
You could use ideas from Multiple Aggregations, it might do everything in one shuffle operations, which is the most expensive operation.
Example:
val df = (Seq((1, "a", "1"),
(1,"b", "3"),
(1,"c", "6"),
(2, "a", "9"),
(2,"c", "10"),
(1,"b","8" ),
(2, "c", "3"),
(3,"r", "19")).toDF("col1", "col2", "col3"))
df.createOrReplaceTempView("data")
val grpRes = spark.sql("""select grouping_id() as gid, col1, col2, round(mean(col3), 2) as res
from data group by col1, col2 grouping sets ((col1), (col2)) """)
grpRes.show(100, false)
Output:
+---+----+----+----+
|gid|col1|col2|res |
+---+----+----+----+
|1 |3 |null|19.0|
|2 |null|b |5.5 |
|2 |null|c |6.33|
|1 |1 |null|4.5 |
|2 |null|a |5.0 |
|1 |2 |null|7.33|
|2 |null|r |19.0|
+---+----+----+----+
gid is a bit funny to use, as it has some binary calculations underneath. But if your grouping columns can not have nulls, than you can use it for selecting the correct groups.
Execution Plan:
scala> grpRes.explain
== Physical Plan ==
*(2) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[avg(cast(col3#9 as double))])
+- Exchange hashpartitioning(col1#111, col2#112, spark_grouping_id#108, 200)
+- *(1) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[partial_avg(cast(col3#9 as double))])
+- *(1) Expand [List(col3#9, col1#109, null, 1), List(col3#9, null, col2#110, 2)], [col3#9, col1#111, col2#112, spark_grouping_id#108]
+- LocalTableScan [col3#9, col1#109, col2#110]
As you can see there is single Exchange operation, the expensive shuffle.

How to compare two columns data in Spark Dataframes using Scala

I want to compare two columns in a Spark DataFrame: if the value of a column (attr_value) is found in values of another (attr_valuelist) I want only that value to be kept. Otherwise, the column value should be null.
For example, given the following input
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
I would expect the following output
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes
2 1 test1 No No
3 2 test2 value1 Value1
I assume, given your sample input, that the column with the search item contains a string while the search target is a sequence of strings. Also, I assume you're interested in case-insensitive search.
This is going to be the input (I added a column that would have yielded a null to test the behavior of the UDF I wrote):
+---+---+--------+----------+----------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+----------------------+
|1 |2 |test |Yes |[Yes, No] |
|2 |1 |test1 |No |[Yes, No] |
|3 |2 |test2 |value1 |[val1, Value1, value2]|
|3 |2 |test2 |value1 |[val1, value2] |
+---+---+--------+----------+----------------------+
You can solve your problem with a very simple UDF.
val find = udf {
(item: String, collection: Seq[String]) =>
collection.find(_.toLowerCase == item.toLowerCase)
}
val df = spark.createDataFrame(Seq(
(1, 2, "test", "Yes", Seq("Yes", "No")),
(2, 1, "test1", "No", Seq("Yes", "No")),
(3, 2, "test2", "value1", Seq("val1", "Value1", "value2")),
(3, 2, "test2", "value1", Seq("val1", "value2"))
)).toDF("id1", "id2", "attrname", "attr_value", "attr_valuelist")
df.select(
$"id1", $"id2", $"attrname", $"attr_value",
find($"attr_value", $"attr_valuelist") as "attr_valuelist")
showing the output of the last command would yield the following output:
+---+---+--------+----------+--------------+
|id1|id2|attrname|attr_value|attr_valuelist|
+---+---+--------+----------+--------------+
| 1| 2| test| Yes| Yes|
| 2| 1| test1| No| No|
| 3| 2| test2| value1| Value1|
| 3| 2| test2| value1| null|
+---+---+--------+----------+--------------+
You can execute this code in any spark-shell. If you are using this from a job you are submitting to a cluster, remember to import spark.implicits._.
can you try this code. I think it will work with that SQL contains case when.
val emptyRDD = sc.emptyRDD[Row]
var emptyDataframe = sqlContext.createDataFrame(emptyRDD, your_dataframe.schema)
your_dataframe.createOrReplaceTempView("tbl")
emptyDataframe = sqlContext.sql("select id1, id2, attrname, attr_value, case when
attr_valuelist like concat('%', attr_value, '%') then attr_value else
null end as attr_valuelist from tbl")
emptyDataframe.show

Iterating on columns in dataframe

I have the following data frames
df1
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9874| 880|
|2016-04-30| 14| FR|9875| 13|
|2017-06-10| 15| PQR|9867|57721|
+----------+----+----+----+-----+
df2
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9879| 820|
|2016-04-30| 14| FR|9785| 9|
|2017-06-10| 15| XYZ|9967|57771|
+----------+----+----+----+-----+
I need to produce my output as following -
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|t1_diff|t2_diff|pr_primary|pr_reference|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
|2016-04-02| 14|NULL|9874| 880|9879| 820| -5| 60| Y| Y|
|2017-06-10| 15| PQR|9867|57721|null| null| null| null| Y| N|
|2017-06-10| 15| XYZ|null| null|9967|57771| null| null| N| Y|
|2016-04-30| 14| FR|9875| 13|9785| 9| 90| 4| Y| Y|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
Here, t1_diff is difference between left T1 and right T1, t2_diff is difference between left T2 and right T2, pr_primary is Y if row is present in df1 and not in df2 and similarly for pr_reference.
I have generated the above with following piece of code
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
val j1 = joined.withColumn("t1_diff",col(s"l.T1") - col(s"r.T1")).withColumn("t2_diff",col(s"l.T2") - col(s"r.T2"))
val isPresentSubstitution = udf( (x: String, y: String) => if (x == null && y == null) "N" else "Y")
j1.withColumn("pr_primary",isPresentSubstitution(col(s"l.T1"), col(s"l.T2"))).withColumn("pr_reference",isPresentSubstitution(col(s"r.T1"), col(s"r.T2"))).show
I want to make it generalize for any number of columns not just T1 and T2. Can someone suggest me a better way to do this ? I am running this in spark.
To be able to set any number of columns like t1_diff with any expresion calculating their values, we need to make some refactoring allowing to use withColumn in a more generic manner.
First, we need to collect the target values: the names of the target columns and the expressions that calculate their contents. This can be done with a sequence of Tuples:
val diffColumns = Seq(
("t1_diff", col("l.T1") - col("r.T1")),
("t2_diff", col("l.T2") - col("r.T2"))
)
// or, to make it more readable, create a dedicated "case class DiffColumn(colName: String, expression: Column)"
Now we can use folding to produce the joined DataFrame from joined and the sequence above:
val joinedWithDiffCols =
diffColumns.foldLeft(joined) { case(df, diffTuple) =>
df.withColumn(diffTuple._1, diffTuple._2)
}
joinedWithDiffCols contains the same data as j1 from the question.
To append new columns, you now have to modify diffColumns sequence only. You can even put the calculation of pr_primary and pr_reference in this sequence (but rename the ref to appendedColumns in this case, to be more precise).
Update
To facilitate the creation of the tuples for diffCollumns, it also can be generalized, for example:
// when both column names are same:
def generateDiff(column: String): (String, Column) = generateDiff(column, column)
// when left and right column names are different:
def generateDiff(leftCol: String, rightCol: String): (String, Column) =
(s"${leftCol}_diff", col("l." + leftCol) - col("r." + rightCol))
val diffColumns = Seq("T1", "T2").map(generateDiff)
End-of-update
Assuming the columns are named same in both df1 and df2, you can do something like:
val diffCols = df1.columns
.filter(_.matches("T\\d+"))
.map(c => col(s"l.$c") - col(s"r.$c") as (s"${c.toLowerCase}_diff") )
And then use it with joined like:
joined.select( ( col("*") :+ diffCols ) :_*).show(false)
//+----------+----+----+----+-----+----+-----+-------+-------+
//|WEEK |DIM1|DIM2|T1 |T2 |T1 |T2 |t1_diff|t2_diff|
//+----------+----+----+----+-----+----+-----+-------+-------+
//|2016-04-02|14 |NULL|9874|880 |9879|820 |-5 |60 |
//|2017-06-10|15 |PQR |9867|57721|null|null |null |null |
//|2017-06-10|15 |XYZ |null|null |9967|57771|null |null |
//|2016-04-30|14 |FR |9875|13 |9785|9 |90 |4 |
//+----------+----+----+----+-----+----+-----+-------+-------+
You can do it by adding sequence number to each dataframe and later join those two dataframes based on seq number.
val df3 = df1.withColumn("SeqNum", monotonicallyIncreasingId)
val df4 = df2.withColumn("SeqNum", monotonicallyIncreasingId)
df3.as("l").join(df4.as("r"),"SeqNum").withColumn("t1_diff",col("l.T1") - col("r.T1")).withColumn("t2_diff",col("l.T2") - col("r.T2")).drop("SeqNum").show()

How to avoid duplicate columns after join?

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.