Joining data in Scala using array_contains() method - scala

I have below data in Scala in Spark environment -
val abc = Seq(
(Array("A"),0.1),
(Array("B"),0.11),
(Array("C"),0.12),
(Array("A","B"),0.24),
(Array("A","C"),0.27),
(Array("B","C"),0.30),
(Array("A","B","C"),0.4)
).toDF("channel_set", "rate")
abc.show(false)
abc.createOrReplaceTempView("abc")
val df = abc.withColumn("totalChannels",size(col("channel_set"))).toDF()
df.show()
scala> df.show
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A]| 0.1| 1|
| [B]|0.11| 1|
| [C]|0.12| 1|
| [A, B]|0.24| 2|
| [A, C]|0.27| 2|
| [B, C]| 0.3| 2|
| [A, B, C]| 0.4| 3|
+-----------+----+-------------+
val oneChannelDF = df.filter($"totalChannels" === 1)
oneChannelDF.show()
oneChannelDF.createOrReplaceTempView("oneChannelDF")
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A]| 0.1| 1|
| [B]|0.11| 1|
| [C]|0.12| 1|
+-----------+----+-------------+
val twoChannelDF = df.filter($"totalChannels" === 2)
twoChannelDF.show()
twoChannelDF.createOrReplaceTempView("twoChannelDF")
+-----------+----+-------------+
|channel_set|rate|totalChannels|
+-----------+----+-------------+
| [A, B]|0.24| 2|
| [A, C]|0.27| 2|
| [B, C]| 0.3| 2|
+-----------+----+-------------+
I want to join oneChannel and twoChannel dataframes so that I can see my resultant data as below -
+-----------+----+-------------+------------+-------+
|channel_set|rate|totalChannels|channel_set | rate |
+-----------+----+-------------+------------+-------+
| [A]| 0.1| 1| [A,B] | 0.24 |
| [A]| 0.1| 1| [A,C] | 0.27 |
| [B]|0.11| 1| [A,B] | 0.24 |
| [B]|0.11| 1| [B,C] | 0.30 |
| [C]|0.12| 1| [A,C] | 0.27 |
| [C]|0.12| 1| [B,C] | 0.30 |
+-----------+----+-------------+------------+-------+
Basically I need all the rows where a record from oneChannel dataframe in present in twoChannel dataframe.
I have tried -
spark.sql("""select * from oneChannelDF one inner join twoChannelDF two on array_contains(one.channel_set,two.channel_set)""").show()
However, I am facing this error -
org.apache.spark.sql.AnalysisException: cannot resolve 'array_contains(one.`channel_set`, two.`channel_set`)' due to data type mismatch: Arguments must be an array followed by a value of same type as the array members; line 1 pos 62;

I guess I figured out the error. I need to pass a member as an argument to the array_contains() method. Since the size of every element in channel_set column for oneChannelDF is 1, hence below code gets me the correct data frame.
scala> spark.sql("""select * from oneChannelDF one inner join twoChannelDF two where array_contains(two.channel_set,one.channel_set[0])""").show()
+-----------+----+-------------+-----------+----+-------------+
|channel_set|rate|totalChannels|channel_set|rate|totalChannels|
+-----------+----+-------------+-----------+----+-------------+
| [A]| 0.1| 1| [A, B]|0.24| 2|
| [A]| 0.1| 1| [A, C]|0.27| 2|
| [B]|0.11| 1| [A, B]|0.24| 2|
| [B]|0.11| 1| [B, C]| 0.3| 2|
| [C]|0.12| 1| [A, C]|0.27| 2|
| [C]|0.12| 1| [B, C]| 0.3| 2|
+-----------+----+-------------+-----------+----+-------------+

Related

How to combine dataframes with no common columns?

I have 2 data frames
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
df1.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
and
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
df2.show
+---+---+---+
| D| E| F|
+---+---+---+
| 11| 22| 33|
| 44| 55| 66|
+---+---+---+
I need to combine the ones above to get
val df3 = Seq(("1","2","3","","",""),("4","5","6","","",""),("","","","11","22","33"),("","","","44","55","66"))
.toDF("A","B","C","D","E","F")
df3.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
Right now I'm creating the missing columns for all dataframes manually to get to a common structure and am then using a union. This code is specific to the dataframes and is not scalable
Looking for a solution that will work with x dataframes with y columns each
You can manually create missing columns in the two data frames and then union them:
import org.apache.spark.sql.DataFrame
val allCols = df1.columns.toSet.union(df2.columns.toSet).toArray
val createMissingCols = (df: DataFrame, allCols: Array[String]) => allCols.foldLeft(df)(
(_df, _col) => if (_df.columns.contains(_col)) _df else _df.withColumn(_col, lit(""))
).select(allCols.head, allCols.tail: _*)
// select is needed to make sure the two data frames have the same order of columns
createMissingCols(df1, allCols).union(createMissingCols(df2, allCols)).show
+---+---+---+---+---+---+
| E| F| A| B| C| D|
+---+---+---+---+---+---+
| | | 1| 2| 3| |
| | | 4| 5| 6| |
| 22| 33| | | | 11|
| 55| 66| | | | 44|
+---+---+---+---+---+---+
A much simpler way of doing this is creating a full outer join and setting the join expression/condition to false:
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
val joined = df1.join(df2, lit(false), "full")
joined.show()
+----+----+----+----+----+----+
| A| B| C| D| E| F|
+----+----+----+----+----+----+
| 1| 2| 3|null|null|null|
| 4| 5| 6|null|null|null|
|null|null|null| 11| 22| 33|
|null|null|null| 44| 55| 66|
+----+----+----+----+----+----+
if you then want to actually set the null values to empty string you can just add:
val withEmptyString = joined.na.fill("")
withEmptyString.show()
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
so in summary df1.join(df2, lit(false), "full").na.fill("") should do the trick.

manipulating data table joining in scala data frame

Given two data frames in scala, T1, T2, I wanna only keep values in T1 where IDs exist in T2, but values don't exist in T2. For instance
T1
+---+---+
| ID|val|
+---+---+
| 1| x|
| 1| y|
| 1| z|
| 2| x|
| 2| y|
| 3| x|
| 3| y|
| 3| z|
| 3| k|
| 4| x|
| 4| y|
| 4| z|
| 5| x|
+---+---+
T2
+---+---+
| ID|val|
+---+---+
| 1| x|
| 1| y|
| 2| x|
| 3| x|
| 3| y|
| 4| x|
| 4| y|
| 4| z|
| 5| x|
+---+---+
For ID=1, values x, y exist in T2, and for ID=2, values x exists in T2, and for ID=3, values x, y exist in T2, and for ID=4, values x, y, z exist in T2, and for ID=5, value x exists in T3, so excluding those values, we get
+---+---+
| ID|val|
+---+---+
| 1| z|
| 2| y|
| 3| z|
| 3| k|
+---+---+
as answer.
I'm trying something like
T1.join(T2, Seq("ID", "val")).filter(T1.col("ID")===T2.col("ID") && T1.col("val")===T2.col("val"))
but not sure if this is correct/efficient approach or not.
Any help would be greatly appreciated. I'm kinda new to scala.
You have several ways to achive this operation.
This operation is the difference between df1 and df2, you could try the following approach with dataframe API:
val result2 = df1.except(df2)
result2.show()
+---+---+
| ID|val|
+---+---+
| 1| z|
| 3| z|
| 2| y|
| 3| k|
+---+---+
Another approach with the Dataframe API would be with anti-join
val joinCondition = col("ID") === col("ID1") and col("val") === col("val1")
val result3 = df1.join(df2, joinCondition, "left_anti")
result3.show()
+---+---+
| ID|val|
+---+---+
| 1| z|
| 3| z|
| 2| y|
| 3| k|
+---+---+
And finally this approach would be transform dataframes to a set to calculate the difference between the two sets
val setting = (df1.collect().toSet -- df2.collect().toSet).toList
val result = sc
.parallelize(setting.map(r => (r(0).toString.toInt,r(1).toString)))
.toDF("ID","val")
result.show()
+---+---+
| ID|val|
+---+---+
| 2| y|
| 3| z|
| 3| k|
| 1| z|
+---+---+

Pyspark Joining two dataframes with a collect_list

suppose I have the following DataFrames.
How can I perform a join between the two of them so that I have a final output in which the resulting column (value_2) takes into account the number of records to be appended based on the value of the ranking column.
import pyspark.sql.functions as f
from pyspark.sql.window import Window
l =[( 9 , 1, 'A' ),
( 9 , 2, 'B' ),
( 9 , 3, 'C' ),
( 9 , 4, 'D' ),
( 10 , 1, 'A' ),
( 10 , 2, 'B' )]
df = spark.createDataFrame(l, ['prod','rank', 'value'])
+----+----+-----+
|prod|rank|value|
+----+----+-----+
| 9| 1| A|
| 9| 2| B|
| 9| 3| C|
| 9| 4| D|
| 10| 1| A|
| 10| 2| B|
+----+----+-----+
sh =[( 9 , ['A','B','C','D'] ),
( 10 , ['A','B'])]
sh = spark.createDataFrame(sh, ['prod', 'conc'])
+----+------------+
|prod| value|
+----+------------+
| 9|[A, B, C, D]|
| 10| [A, B]|
+----+------------+
Final desidered output:
+----+----+-----+---------+
|prod|rank|value| value_2 |
+----+----+-----+---------+
| 9| 1| A| A |
| 9| 2| B| A,B |
| 9| 3| C| A,B,C |
| 9| 4| D| A,B,C,D|
| 10| 1| A| A |
| 10| 2| B| A,B |
+----+----+-----+---------+
You can use Window function and do this before aggregation; In spark 2.4+
df.select('*',
f.array_join(
f.collect_list(df.value).over(Window.partitionBy('prod').orderBy('rank')),
','
).alias('value_2')
).show()
+----+----+-----+-------+
|prod|rank|value|value_2|
+----+----+-----+-------+
| 9| 1| A| A|
| 9| 2| B| A,B|
| 9| 3| C| A,B,C|
| 9| 4| D|A,B,C,D|
| 10| 1| A| A|
| 10| 2| B| A,B|
+----+----+-----+-------+
Or if you don't need to join array as strings:
df.select('*',
f.collect_list(df.value).over(Window.partitionBy('prod').orderBy('rank')).alias('value_2')
).show()
+----+----+-----+------------+
|prod|rank|value| value_2|
+----+----+-----+------------+
| 9| 1| A| [A]|
| 9| 2| B| [A, B]|
| 9| 3| C| [A, B, C]|
| 9| 4| D|[A, B, C, D]|
| 10| 1| A| [A]|
| 10| 2| B| [A, B]|
+----+----+-----+------------+

How to efficiently perform this column operation on a Spark Dataframe?

I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?
I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2

Spark - How to apply rules defined in a dataframe to another dataframe

I'm trying to solve this kind of problem with Spark 2, but I can't find a solution.
I have a dataframe A :
+----+-------+------+
|id |COUNTRY| MONTH|
+----+-------+------+
| 1 | US | 1 |
| 2 | FR | 1 |
| 4 | DE | 1 |
| 5 | DE | 2 |
| 3 | DE | 3 |
+----+-------+------+
And a dataframe B :
+-------+------+------+
|COLUMN |VALUE | PRIO |
+-------+------+------+
|COUNTRY| US | 5 |
|COUNTRY| FR | 15 |
|MONTH | 3 | 2 |
+-------+------+------+
The idea is to apply "rules" of dataframe B on dataframe A in order to get this result :
dataframe A' :
+----+-------+------+------+
|id |COUNTRY| MONTH| PRIO |
+----+-------+------+------+
| 1 | US | 1 | 5 |
| 2 | FR | 1 | 15 |
| 4 | DE | 1 | 20 |
| 5 | DE | 2 | 20 |
| 3 | DE | 3 | 2 |
+----+-------+------+------+
I tried someting like that :
dfB.collect.foreach( r =>
var dfAp = dfA.where(r.getAs("COLUMN") == r.getAs("VALUE"))
dfAp.withColumn("PRIO", lit(r.getAs("PRIO")))
)
But I'm sure it's not the right way.
What are the strategy to solve this problem in Spark ?
Working under assumption that the set of rules is reasonably small (possible concerns are the size of the data and the size of generated expression, which in the worst case scenario, can crash the planner) the simplest solution is to use local collection and map it to a SQL expression:
import org.apache.spark.sql.functions.{coalesce, col, lit, when}
val df = Seq(
(1, "US", "1"), (2, "FR", "1"), (4, "DE", "1"),
(5, "DE", "2"), (3, "DE", "3")
).toDF("id", "COUNTRY", "MONTH")
val rules = Seq(
("COUNTRY", "US", 5), ("COUNTRY", "FR", 15), ("MONTH", "3", 2)
).toDF("COLUMN", "VALUE", "PRIO")
val prio = coalesce(rules.as[(String, String, Int)].collect.map {
case (c, v, p) => when(col(c) === v, p)
} :+ lit(20): _*)
df.withColumn("PRIO", prio)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
You can replace coalesce with least or greatest to apply the smallest or the largest matching value respectively.
With larger set of rules you could:
melt data to convert to a long format.
val dfLong = df.melt(Seq("id"), df.columns.tail, "COLUMN", "VALUE")
join by column and value.
Aggregate PRIOR by id with appropriate aggregation function (for example min):
val priorities = dfLong.join(rules, Seq("COLUMN", "VALUE"))
.groupBy("id")
.agg(min("PRIO").alias("PRIO"))
Outer join the output with df by id.
df.join(priorities, Seq("id"), "leftouter").na.fill(20)
+---+-------+-----+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+-----+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+-----+----+
lets assume rules of dataframeB is limited
I have created dataframe "df" for below table
+---+-------+------+
| id|COUNTRY|MONTH|
+---+-------+------+
| 1| US| 1|
| 2| FR| 1|
| 4| DE| 1|
| 5| DE| 2|
| 3| DE| 3|
+---+-------+------+
By using UDF
val code = udf{(x:String,y:Int)=>if(x=="US") "5" else if (x=="FR") "15" else if (y==3) "2" else "20"}
df.withColumn("PRIO",code($"COUNTRY",$"MONTH")).show()
output
+---+-------+------+----+
| id|COUNTRY|MONTH|PRIO|
+---+-------+------+----+
| 1| US| 1| 5|
| 2| FR| 1| 15|
| 4| DE| 1| 20|
| 5| DE| 2| 20|
| 3| DE| 3| 2|
+---+-------+------+----+