How to join multiple columns from one DataFrame with another DataFrame

How to join multiple columns from one DataFrame with another DataFrame - scala

I have two DataFrames recommendations and movies. Columns rec1-rec3 in recommendations represent movie id from movies dataframe.
val recommendations: DataFrame = List(
(0, 1, 2, 3),
(1, 2, 3, 4),
(2, 1, 3, 4)).toDF("id", "rec1", "rec2", "rec3")
val movies = List(
(1, "the Lord of the Rings"),
(2, "Star Wars"),
(3, "Star Trek"),
(4, "Pulp Fiction")).toDF("id", "name")
What I want:
+---+------------------------+------------+------------+
| id| rec1| rec2| rec3|
+---+------------------------+------------+------------+
| 0| the Lord of the Rings| Star Wars| Star Trek|
| 1| Star Wars| Star Trek|Pulp Fiction|
| 2| the Lord of the Rings| Star Trek| Star Trek|
+---+------------------------+------------+------------+

We can also use the functions stack() and pivot() to arrive at your expected output, joining the two dataframes only once.
// First rename 'id' column to 'ids' avoid duplicate names further downstream
val moviesRenamed = movies.withColumnRenamed("id", "ids")
recommendations.select($"id", expr("stack(3, 'rec1', rec1, 'rec2', rec2, 'rec3', rec3) as (rec, movie_id)"))
.where("rec is not null")
.join(moviesRenamed, col("movie_id") === moviesRenamed.col("ids"))
.groupBy("id")
.pivot("rec")
.agg(first("name"))
.show()
+---+--------------------+---------+------------+
| id| rec1| rec2| rec3|
+---+--------------------+---------+------------+
| 0|the Lord of the R...|Star Wars| Star Trek|
| 1| Star Wars|Star Trek|Pulp Fiction|
| 2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

I figured it out. You should create aliases for your columns just like in SQL.
val joined = recommendation
.join(movies.select(col("id").as("id1"), 'name.as("n1")), 'id1 === recommendation.col("rec1"))
.join(movies.select(col("id").as("id2"), 'name.as("n2")), 'id2 === recommendation.col("rec2"))
.join(movies.select(col("id").as("id3"), 'name.as("n3")), 'id3 === recommendation.col("rec3"))
.select('id, 'n1, 'n2, 'n3)
joined.show()
Query will result in
+---+--------------------+---------+------------+
| id| n1| n2| n3|
+---+--------------------+---------+------------+
| 0|the Lord of the R...|Star Wars| Star Trek|
| 1| Star Wars|Star Trek|Pulp Fiction|
| 2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

Related

Is left join and right_outer join the same if the tables are positioned differently, in pyspark?

I have 2 dataframes in PySpark,
df1 = spark.createDataFrame([
("s1", "artist1"),
("s2", "artist2"),
("s3", "artist3"),
],
['song_id', 'artist'])
df1.show()
df2 = spark.createDataFrame([
("s1", "2"),
("s1", "3"),
("s4", "4"),
("s4", "5")
],
['song_id', 'duration'])
df2.show()
Output:
+-------+-------+
|song_id| artist|
+-------+-------+
| s1|artist1|
| s2|artist2|
| s3|artist3|
+-------+-------+
+-------+-----+
|song_id|col_2|
+-------+-----+
| s1| hmm|
| s1| hmmm|
| s4| acha|
| s4| ohoo|
+-------+-----+
I apply right_outer and left join on these 2 dataframes and they both seem to give me the same result-
df1.join(df2, on="song_id", how="right_outer").show()
df2.join(df1, on="song_id", how="left").show()
Output:
+-------+-------+--------+
|song_id| artist|duration|
+-------+-------+--------+
| s1|artist1| 2|
| s1|artist1| 3|
| s4| null| 4|
| s4| null| 5|
+-------+-------+--------+
+-------+--------+-------+
|song_id|duration| artist|
+-------+--------+-------+
| s1| 2|artist1|
| s1| 3|artist1|
| s4| 4| null|
| s4| 5| null|
+-------+--------+-------+
I am not sure how to use these 2 joins effectively.
What is the difference between these 2 joins?

The left and right joins gives result based on the order of table respective to join keyword.
Left/leftouter/left_outer joins are all same and shows the whole left table and the matching records of the right table.
Right/rightouter/right_outer joins are all same and shows the whole right table and the matching records of the left table.
In the code
df1.join(df2, on="song_id", how="right_outer").show()
df1 is the left table(dataframe) and df2 is the right table and join type is right_outer, hence it shows all the rows of df2 and matching rows of the df1.
Similarly in
df2.join(df1, on="song_id", how="left").show()
df2 is the left table and df1 is the right table and the join type is left, so it shows all records of df2 and matching records of df1.
Hence both code shows the same result.
df1.join(df2, on="song_id", how="right_outer").show()
df1.join(df2, on="song_id", how="left").show()
In the above code, I have placed df1 as left table in both queries.
And here is the result:-
song_id
artist
duration
s1
artist1
2
s1
artist1
3
s4
null
4
s4
null
5
song_id
artist
duration
s1
artist1
2
s1
artist1
3
s2
artist2
null
s3
artist3
null
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join
You can use this for reference.

Filter a dataframe using a list of tuples in spark scala

I am trying to filter a dataframe in scala by comparing two of its columns (subject and stream in this case) to a list of tuples. If the column values and the tuple values are equal the row is filtered.
val df = Seq(
(0, "Mark", "Maths", "Science"),
(1, "Tyson", "History", "Commerce"),
(2, "Gerald", "Maths", "Science"),
(3, "Katie", "Maths", "Commerce"),
(4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")
Sample input:
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
| 3| Katie| Maths|Commerce|
| 4| Linda|History| Science|
+---+------+-------+--------+
List of tuple based on which the above df needs to be filtered
val listOfTuples = List[(String, String)] (
("Maths" , "Science"),
("History" , "Commerce")
)
Expected result :
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
+---+------+-------+--------+

You can either do it with isin with structs (needs spark 2.2+):
val df_filtered = df
.where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))
or with leftsemi join:
val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")

You can simply filter as
val resultDF = df.filter(row => {
List(
("Maths", "Science"),
("History", "Commerce")
).contains(
(row.getAs[String]("subject"), row.getAs[String]("stream")))
})
Hope this helps!

How to compare two columns data in Spark Dataframes using Scala

I want to compare two columns in a Spark DataFrame: if the value of a column (attr_value) is found in values of another (attr_valuelist) I want only that value to be kept. Otherwise, the column value should be null.
For example, given the following input
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
I would expect the following output
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes
2 1 test1 No No
3 2 test2 value1 Value1

I assume, given your sample input, that the column with the search item contains a string while the search target is a sequence of strings. Also, I assume you're interested in case-insensitive search.
This is going to be the input (I added a column that would have yielded a null to test the behavior of the UDF I wrote):
+---+---+--------+----------+----------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+----------------------+
|1 |2 |test |Yes |[Yes, No] |
|2 |1 |test1 |No |[Yes, No] |
|3 |2 |test2 |value1 |[val1, Value1, value2]|
|3 |2 |test2 |value1 |[val1, value2] |
+---+---+--------+----------+----------------------+
You can solve your problem with a very simple UDF.
val find = udf {
(item: String, collection: Seq[String]) =>
collection.find(_.toLowerCase == item.toLowerCase)
}
val df = spark.createDataFrame(Seq(
(1, 2, "test", "Yes", Seq("Yes", "No")),
(2, 1, "test1", "No", Seq("Yes", "No")),
(3, 2, "test2", "value1", Seq("val1", "Value1", "value2")),
(3, 2, "test2", "value1", Seq("val1", "value2"))
)).toDF("id1", "id2", "attrname", "attr_value", "attr_valuelist")
df.select(
$"id1", $"id2", $"attrname", $"attr_value",
find($"attr_value", $"attr_valuelist") as "attr_valuelist")
showing the output of the last command would yield the following output:
+---+---+--------+----------+--------------+
|id1|id2|attrname|attr_value|attr_valuelist|
+---+---+--------+----------+--------------+
| 1| 2| test| Yes| Yes|
| 2| 1| test1| No| No|
| 3| 2| test2| value1| Value1|
| 3| 2| test2| value1| null|
+---+---+--------+----------+--------------+
You can execute this code in any spark-shell. If you are using this from a job you are submitting to a cluster, remember to import spark.implicits._.

can you try this code. I think it will work with that SQL contains case when.
val emptyRDD = sc.emptyRDD[Row]
var emptyDataframe = sqlContext.createDataFrame(emptyRDD, your_dataframe.schema)
your_dataframe.createOrReplaceTempView("tbl")
emptyDataframe = sqlContext.sql("select id1, id2, attrname, attr_value, case when
attr_valuelist like concat('%', attr_value, '%') then attr_value else
null end as attr_valuelist from tbl")
emptyDataframe.show

How to reduce multiple string values to predefined categories in a column

I would like to reduce values of a specific column in a dataframe based on a predefined pattern matching categories.
Example:
val df = spark.createDataFrame(Seq(
(1, "apple"),
(2, "banana"),
(3, "avocado"),
(4, "potato"))).toDF("Id", "category")
Id category
1 apple
2 banana
3 avocado
4 potato
Desired output:
val df_reduced = spark.createDataFrame(Seq(
(1, "fruit"),
(2, "fruit"),
(3, "vegetable"),
(4, "vegetable"))).toDF("Id", "category")
Id category
1 fruit
2 fruit
3 vegetable
4 vegetable
This is the solution I came up with:
df.withColumn("category", when(col("category") === "apple", regexp_replace(col("category"), "apple", "fruit"))
.otherwise(when(col("category") === "banana", regexp_replace(col("category"), "banana", "fruit"))
.otherwise(when(col("category") === "avocado", regexp_replace(col("category"), "avocado", "vegetable"))
.otherwise(when(col("category") === "potato", regexp_replace(col("category"), "potato", "vegetable"))
))))
.show
I don't really like this nested when-otherwise approach, so I would like to know: is there a better, more idiomatic solution for this task?

You can create a lookup dataframe as
val lookupDF = spark.createDataFrame(Seq(
("apple", "fruit"),
("banana", "fruit"),
("avocado", "vegetable"),
("potato", "vegetable"))).toDF("category", "category2")
// +--------+---------+
// |category|category2|
// +--------+---------+
// |apple |fruit |
// |banana |fruit |
// |avocado |vegetable|
// |potato |vegetable|
// +--------+---------+
Since the lookup dataframe is definitely going to be small you can use broadcast function for joining
import org.apache.spark.sql.functions._
df.join(broadcast(lookupDF), Seq("category"), "left")
.select(col("Id"), col("category2").as("category"))
.show(false)
which should give you
+---+---------+
|Id |category |
+---+---------+
|1 |fruit |
|2 |fruit |
|3 |vegetable|
|4 |vegetable|
+---+---------+
I hope the answer is helpful
Updated
You've commented
what about missing values? if I have a category in the original df that is not present in the lookup df? I get null, advice on how to tackle it? I would prefer to keep the original value if no match is found in the lookup table, but I am unable to do it with joins
To tackle such case you can use when/otherwise function as
import org.apache.spark.sql.functions._
df.join(broadcast(lookupDF), Seq("category"), "left")
.select(col("Id"), when(col("category2").isNotNull, col("category2")).otherwise(col("category")).as("category"))
.show(false)

I think , you should take help of map and udf like below
import org.apache.spark.sql.functions._
val map=Map("Apple"->"fruit","Mango"->"fruit","potato"->"vegetable","avocado"->"vegetable","Banana"->"fruit")
val replaceUDF=udf((name:String)=>map.getOrElse(name, name))
val outputdf=df.withColumn("new_category", replaceUDF(col("category"))
Sample Output:
+---+--------+------------+
| Id|category|new_category|
+---+--------+------------+
| 1| Apple| fruit|
| 2| Banana| fruit|
| 3| potato| vegetable|
| 4| avocado| vegetable|
| 5| Mango| fruit|
+---+--------+------------+

Scala LEFT JOIN on dataframes using two columns (case insensitive)

I have created the below method which takes two Dataframes; lhs & rhs and their respective first and second columns as input. The method should return the result of a left join between these two frames using the two columns provided for each dataframe (ignoring their case sensitivity).
The problem I am facing is that it is doing more of an inner join. It is is returning 3 times the number of the rows that is in the lhs data frame (due to duplicate values in rhs), but as it is a left join the duplication and number of rows in rhs dataframe should not matter.
def leftJoinCaseInsensitive(lhs: DataFrame, rhs: DataFrame, leftTableColumn: String, rightTableColumn: String, leftTableColumn1: String, rightTableColumn1: String): DataFrame = {
val joined: DataFrame = lhs.join(rhs, upper(lhs.col(leftTableColumn)) === upper(rhs.col(rightTableColumn)) && upper(lhs.col(leftTableColumn1)) === upper(rhs.col(rightTableColumn1)), "left");
return joined
}

If there are duplicate values in rhs, then it is normal for lhs to get replicated. If a joining values in joining columns from lhs row matches with multiple rhs rows then joined dataframe should have multiple rows from lhs matching the rows from rhs.
for example
lhs dataframe
+--------+--------+--------+
|col1left|col2left|col3left|
+--------+--------+--------+
|a |1 |leftside|
+--------+--------+--------+
And
rhs dataframe
+---------+---------+---------+
|col1right|col2right|col3right|
+---------+---------+---------+
|a |1 |rightside|
|a |1 |rightside|
+---------+---------+---------+
Then it is normal to have left join as
left joined lhs with rhs
+--------+--------+--------+---------+---------+---------+
|col1left|col2left|col3left|col1right|col2right|col3right|
+--------+--------+--------+---------+---------+---------+
|a |1 |leftside|a |1 |rightside|
|a |1 |leftside|a |1 |rightside|
+--------+--------+--------+---------+---------+---------+
You can have more information here

but as it is a left join the duplication and number of rows in rhs
dataframe should not matter
Not true. Your leftJoinCaseInsensitive method looks good to me. A left join would still produce more rows than the left table's if the right table has duplicated key column(s), as shown below:
val dfR = Seq(
(1, "a", "x"),
(1, "a", "y"),
(2, "b", "z")
).toDF("k1", "k2", "val")
val dfL = Seq(
(1, "a", "u"),
(2, "b", "v"),
(3, "c", "w")
).toDF("k1", "k2", "val")
leftJoinCaseInsensitive(dfL, dfR, "k1", "k1", "k2", "k2")
res1.show
+---+---+---+----+----+----+
| k1| k2|val| k1| k2| val|
+---+---+---+----+----+----+
| 1| a| u| 1| a| y|
| 1| a| u| 1| a| x|
| 2| b| v| 2| b| z|
| 3| c| w|null|null|null|
+---+---+---+----+----+----+