How to efficiently perform this column operation on a Spark Dataframe? - scala

I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?

I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2

Related

How to combine dataframes with no common columns?

I have 2 data frames
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
df1.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
and
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
df2.show
+---+---+---+
| D| E| F|
+---+---+---+
| 11| 22| 33|
| 44| 55| 66|
+---+---+---+
I need to combine the ones above to get
val df3 = Seq(("1","2","3","","",""),("4","5","6","","",""),("","","","11","22","33"),("","","","44","55","66"))
.toDF("A","B","C","D","E","F")
df3.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
Right now I'm creating the missing columns for all dataframes manually to get to a common structure and am then using a union. This code is specific to the dataframes and is not scalable
Looking for a solution that will work with x dataframes with y columns each
You can manually create missing columns in the two data frames and then union them:
import org.apache.spark.sql.DataFrame
val allCols = df1.columns.toSet.union(df2.columns.toSet).toArray
val createMissingCols = (df: DataFrame, allCols: Array[String]) => allCols.foldLeft(df)(
(_df, _col) => if (_df.columns.contains(_col)) _df else _df.withColumn(_col, lit(""))
).select(allCols.head, allCols.tail: _*)
// select is needed to make sure the two data frames have the same order of columns
createMissingCols(df1, allCols).union(createMissingCols(df2, allCols)).show
+---+---+---+---+---+---+
| E| F| A| B| C| D|
+---+---+---+---+---+---+
| | | 1| 2| 3| |
| | | 4| 5| 6| |
| 22| 33| | | | 11|
| 55| 66| | | | 44|
+---+---+---+---+---+---+
A much simpler way of doing this is creating a full outer join and setting the join expression/condition to false:
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
val joined = df1.join(df2, lit(false), "full")
joined.show()
+----+----+----+----+----+----+
| A| B| C| D| E| F|
+----+----+----+----+----+----+
| 1| 2| 3|null|null|null|
| 4| 5| 6|null|null|null|
|null|null|null| 11| 22| 33|
|null|null|null| 44| 55| 66|
+----+----+----+----+----+----+
if you then want to actually set the null values to empty string you can just add:
val withEmptyString = joined.na.fill("")
withEmptyString.show()
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
so in summary df1.join(df2, lit(false), "full").na.fill("") should do the trick.

manipulating data table joining in scala data frame

Given two data frames in scala, T1, T2, I wanna only keep values in T1 where IDs exist in T2, but values don't exist in T2. For instance
T1
+---+---+
| ID|val|
+---+---+
| 1| x|
| 1| y|
| 1| z|
| 2| x|
| 2| y|
| 3| x|
| 3| y|
| 3| z|
| 3| k|
| 4| x|
| 4| y|
| 4| z|
| 5| x|
+---+---+
T2
+---+---+
| ID|val|
+---+---+
| 1| x|
| 1| y|
| 2| x|
| 3| x|
| 3| y|
| 4| x|
| 4| y|
| 4| z|
| 5| x|
+---+---+
For ID=1, values x, y exist in T2, and for ID=2, values x exists in T2, and for ID=3, values x, y exist in T2, and for ID=4, values x, y, z exist in T2, and for ID=5, value x exists in T3, so excluding those values, we get
+---+---+
| ID|val|
+---+---+
| 1| z|
| 2| y|
| 3| z|
| 3| k|
+---+---+
as answer.
I'm trying something like
T1.join(T2, Seq("ID", "val")).filter(T1.col("ID")===T2.col("ID") && T1.col("val")===T2.col("val"))
but not sure if this is correct/efficient approach or not.
Any help would be greatly appreciated. I'm kinda new to scala.
You have several ways to achive this operation.
This operation is the difference between df1 and df2, you could try the following approach with dataframe API:
val result2 = df1.except(df2)
result2.show()
+---+---+
| ID|val|
+---+---+
| 1| z|
| 3| z|
| 2| y|
| 3| k|
+---+---+
Another approach with the Dataframe API would be with anti-join
val joinCondition = col("ID") === col("ID1") and col("val") === col("val1")
val result3 = df1.join(df2, joinCondition, "left_anti")
result3.show()
+---+---+
| ID|val|
+---+---+
| 1| z|
| 3| z|
| 2| y|
| 3| k|
+---+---+
And finally this approach would be transform dataframes to a set to calculate the difference between the two sets
val setting = (df1.collect().toSet -- df2.collect().toSet).toList
val result = sc
.parallelize(setting.map(r => (r(0).toString.toInt,r(1).toString)))
.toDF("ID","val")
result.show()
+---+---+
| ID|val|
+---+---+
| 2| y|
| 3| z|
| 3| k|
| 1| z|
+---+---+

spark aggregation count on condition

I'm trying to group a data frame, then when aggregating rows, with a count, I want to apply a condition on rows before counting.
here is an example :
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.show
+---+---+
| _1| _2|
+---+---+
| A| X|
| A| X|
| B| O|
| B| O|
| c| O|
| c| X|
| d| X|
| d| O|
+---+---+
in this example I want to group by column _1 on count on column _2 when the value ='X'
here is the expected result :
+---+-----------+
| _1| count(_2) |
+---+-----------+
| A| 2 |
| B| 0 |
| c| 1 |
| d| 1 |
+---+-----------+
Use when to get this aggregation. PySpark solution shown here.
from pyspark.sql.functions import when,count
test.groupBy(col("col_1")).agg(count(when(col("col_2") == 'X',1))).show()
import spark.implicits._
val test=Seq(("A","X"),("A","X"),("B","O"),("B","O"),("c","O"),("c","X"),("d","X"),("d","O")).toDF
test.groupBy("_1").agg(count(when($"_2"==="X", 1)).as("count")).orderBy("_1").show
+---+-----+
| _1|count|
+---+-----+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-----+
As alternative, in Scala, it can be:
val counter1 = test.select( col("_1"),
when(col("_2") === lit("X"), lit(1)).otherwise(lit(0)).as("_2"))
val agg1 = counter1.groupBy("_1").agg(sum("_2")).orderBy("_1")
agg1.show
gives result:
+---+-------+
| _1|sum(_2)|
+---+-------+
| A| 2|
| B| 0|
| c| 1|
| d| 1|
+---+-------+

Dynamic column selection in Spark (based on another column's value)

With the given Spark DataFrame:
> df.show()
+---+-----+---+---+---+---+
| id|delay| p1| p2| p3| p4|
+---+-----+---+---+---+---+
| 1| 3| a| b| c| d|
| 2| 1| m| n| o| p|
| 3| 2| q| r| s| t|
+---+-----+---+---+---+---+
How to select a column dynamically so that the new, col column is the result of the p{delay} existing column?
> df.withColumn("col", /* ??? */).show()
+---+-----+---+---+---+---+----+
| id|delay| p1| p2| p3| p4| col|
+---+-----+---+---+---+---+----+
| 1| 3| a| b| c| d| c| // col = p3
| 2| 1| m| n| o| p| m| // col = p1
| 3| 2| q| r| s| t| r| // col = p2
+---+-----+---+---+---+---+----+
The simplest solution I can think of is to use array with delay as an index:
import org.apache.spark.sql.functions.array
df.withColumn("col", array($"p1", $"p2", $"p3", $"p4")($"delay" - 1))
One option is create a map from number to column names, and then use foldLeft to update the col column with corresponding values:
val cols = (1 to 4).map(i => i -> s"p$i")
(cols.foldLeft(df.withColumn("col", lit(null))){
case (df, (k, v)) => df.withColumn("col", when(df("delay") === k, df(v)).otherwise(df("col")))
}).show
+---+-----+---+---+---+---+---+
| id|delay| p1| p2| p3| p4|col|
+---+-----+---+---+---+---+---+
| 1| 3| a| b| c| d| c|
| 2| 1| m| n| o| p| m|
| 3| 2| q| r| s| t| r|
+---+-----+---+---+---+---+---+

spark dataframe groupby multiple times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
(2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
toDF("col1", "col2", "col3"))
So I have a spark dataframe with 3 columns.
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 10|
| 1| b| 12|
| 1| c| 13|
| 2| a| 14|
| 2| c| 11|
| 1| b| 12|
| 2| c| 12|
| 3| r| 11|
+----+----+----+
My requirement is actually I need to perform two levels of groupby as explained below.
Level1:
If I do groupby on col1 and do a sum of Col3. I will get below two columns.
1. col1
2. sum(col3)
I will loose col2 here.
Level2:
If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns.
1. col1
2. col2
3. sum(col3)
My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.
How can I do this, can anyone explain?
spark : 1.6.2
Scala : 2.10
One option is to do the two sum separately and then join them back:
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 3| r| 11.0| 11.0|
| 1| a| 10.0| 47.0|
+----+----+----------+----------+
Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
withColumn("sum_level1", sum($"sum_level2").over(w)).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 1| a| 10.0| 47.0|
| 3| r| 11.0| 11.0|
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
+----+----+----------+----------+