Group by and find count before doing pivot spark - scala

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot

On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))

Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>

This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

Related

How to apply a customized function with multiple parameters to each group of a dataframe and union the resulting dataframes in Scala Spark?

I have a customized function that looks like this that returns a different dataframe as the output
def customizedfun(data : DataFrame, param1 : Boolean, param2 : string) : DataFrame = {...}
and I want to apply this function to each group of
df.groupBy("type")
then append the output dataframes from each type into one dataframe.
This is a little different from other questions regarding applying customized functions to grouped dataframes in that this function also take other inputs, in addition to the dataframe in question df.groupBy("type").
What's the best way to do this?
You can filter down the original df to the different groups, call customizedfun for each group and then union the results.
I assume that customizedfun is a function that simply adds the two parameters as a new column, but it could be any function:
def customizedfun(data : DataFrame, param1 : Boolean, param2 : String) : DataFrame =
data.withColumn("newCol", lit(s"$param2 $param1"))
I need two helper function that calculate the values of param1 and param2 dependent on the value of type. In a real world application, these functions could be for example a lookup into a dictionary.
def calcParam1(typ: Integer): Boolean = typ % 2 == 0
def calcParam2(typ: Integer): String = s"type is $typ"
Now the original df is filtered into the different groups, customizedfun is called and the result is unioned:
//create some test data
val df = Seq((1, "A", "a"), (1, "B", "b"), (1, "C", "c"), (2, "D", "d"), (2, "E", "e"), (3, "F", "f"))
.toDF("type", "val1", "val2")
//+----+----+----+
//|type|val1|val2|
//+----+----+----+
//| 1| A| a|
//| 1| B| b|
//| 1| C| c|
//| 2| D| d|
//| 2| E| e|
//| 3| F| f|
//+----+----+----+
//get the distinct values of column type
val distinctTypes = df.select("type").distinct().as[Integer].collect()
//call customizedfun for each group
val resultPerGroup= for( typ <- distinctTypes)
yield customizedfun( df.filter(s"type = $typ"), calcParam1(typ), calcParam2(typ))
//the final union
val result = resultPerGroup.tail.foldLeft(resultPerGroup.head)(_ union _)
//+----+----+----+---------------+
//|type|val1|val2| newCol|
//+----+----+----+---------------+
//| 1| A| a|type is 1 false|
//| 1| B| b|type is 1 false|
//| 1| C| c|type is 1 false|
//| 3| F| f|type is 3 false|
//| 2| D| d| type is 2 true|
//| 2| E| e| type is 2 true|
//+----+----+----+---------------+

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?
If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Spark scala column level mismatches from 2 dataframes

I have 2 dataframes
val df1 = Seq((1, "1","6"), (2, "10","8"), (3, "6","4")).toDF("id", "value1","value2")
val df2 = Seq((1, "1","6"), (2, "5","4"), (4, "3","1")).toDF("id", "value1","value2")
and i want to find the difference of column level
output should look like
id,value1_df1,value1_df2,diff_value1,value2_df1,value_df2,diff_value2
1, 1 ,1 , 0 , 6 ,6 ,0
2, 10 ,5 , 5 , 8 ,4 ,4
3, 6 ,3 , 1 , 4 ,1 ,3
like wise i have 100's of column and want to compute difference between same column in 2 dataframes columns are dynamic
Maybe this will help:
val spark = SparkSession.builder.appName("Test").master("local[*]").getOrCreate();
import spark.implicits._
var df1 = Seq((1, "1", "6"), (2, "10", "8"), (3, "6", "4")).toDF("id", "value1", "value2")
var df2 = Seq((1, "1", "6"), (2, "5", "4"), (3, "3", "1")).toDF("id", "value1", "value2")
df1.columns.foreach(column => {
df1 = df1.withColumn(column, df1.col(column).cast(IntegerType))
})
df2.columns.foreach(column => {
df2 = df2.withColumn(column, df2.col(column).cast(IntegerType))
})
df1 = df1.withColumnRenamed("id", "df1_id")
df2 = df2.withColumnRenamed("id", "df2_id")
df1.show()
df2.show()
so till now you have two dataframes with value_x,value_y,value_z and going on ...
df1:
+------+------+------+
|df1_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 10| 8|
| 3| 6| 4|
+------+------+------+
df2:
+------+------+------+
|df2_id|value1|value2|
+------+------+------+
| 1| 1| 6|
| 2| 5| 4|
| 3| 3| 1|
+------+------+------+
Now we are gonna join them base on id:
var df3 = df1.alias("df1").join(df2.alias("df2"), $"df1.df1_id" === $"df2.df2_id")
and last, we gonna take all columns on df1/df2 (* Its important that they will have the same columns) - without the id, and create a new column of the diff:
df1.columns.tail.foreach(col => {
val new_col_name = s"${col}-diff"
val df_a_col = s"df1.${col}"
val df_b_col = s"df2.${col}"
df3 = df3.withColumn(new_col_name, df3.col(df_a_col) - df3.col(df_b_col))
})
df3.show()
Result:
+------+------+------+------+------+------+-----------+-----------+
|df1_id|value1|value2|df2_id|value1|value2|value1-diff|value2-diff|
+------+------+------+------+------+------+-----------+-----------+
| 1| 1| 6| 1| 1| 6| 0| 0|
| 2| 10| 8| 2| 5| 4| 5| 4|
| 3| 6| 4| 3| 3| 1| 3| 3|
+------+------+------+------+------+------+-----------+-----------+
This is the result, and it`s dynamic so you can add valueX you want.

Grouping by values on a Spark Dataframe

I'm working on a Spark dataframe containing this kind of data:
A,1,2,3
B,1,2,3
C,1,2,3
D,4,2,3
I want to aggegate this data on the three last columns, so the output would be :
ABC,1,2,3
D,4,2,3
How can I do it in scala ? (this is not a big dataframe so performance is secondary here)
As mentioned in the comments you can first use groupBy to group your columns and then use concat_ws on your first column. Here is one way of doing it,
//create you original DF
val df = Seq(("A",1,2,3),("B",1,2,3),("C",1,2,3),("D",4,2,3)).toDF("col1","col2","col3","col4")
df.show
//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 1| 2| 3|
| B| 1| 2| 3|
| C| 1| 2| 3|
| D| 4| 2| 3|
+----+----+----+----+
//group by "col2","col3","col4" and store "col1" as list and then
//convert it to string
df.groupBy("col2","col3","col4")
.agg(collect_list("col1").as("col1"))
//you can change the string separator by concat_ws first arg
.select(concat_ws("", $"col1") as "col1",$"col2",$"col3",$"col4").show
//output
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| D| 4| 2| 3|
| ABC| 1| 2| 3|
+----+----+----+----+
Alternatively you can map by your key in this case c2, c3, c4 and then concatenate your values via reduce by key. In the end I format each row as needed through the last map. It should be something like the following:
val data=sc.parallelize(List(
("A", "1", "2", "3"),
("B", "1", "2", "3"),
("C", "1", "2", "3"),
("D", "4", "2", "3")))
val res = data.map{ case (c1, c2, c3, c4) => ((c2, c3, c4), String.valueOf(c1)) }
.reduceByKey((x, y) => x + y)
.map(v => v._2.toString + "," + v._1.productIterator.toArray.mkString(","))
.collect

Select column by name with multiple aggregate columns after pivot with Spark Scala

I am trying to aggregate multitple columns after a pivot in Scala Spark 2.0.1:
scala> val df = List((1, 2, 3, None), (1, 3, 4, Some(1))).toDF("a", "b", "c", "d")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| 1| 2| 3|null|
| 1| 3| 4| 1|
+---+---+---+----+
scala> val pivoted = df.groupBy("a").pivot("b").agg(max("c"), max("d"))
pivoted: org.apache.spark.sql.DataFrame = [a: int, 2_max(`c`): int ... 3 more fields]
scala> pivoted.show
+---+----------+----------+----------+----------+
| a|2_max(`c`)|2_max(`d`)|3_max(`c`)|3_max(`d`)|
+---+----------+----------+----------+----------+
| 1| 3| null| 4| 1|
+---+----------+----------+----------+----------+
I am unable to select or rename those columns so far:
scala> pivoted.select("3_max(`d`)")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: 3_max(`d`);
scala> pivoted.select("`3_max(`d`)`")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_max(`d`)`;
scala> pivoted.select("`3_max(d)`")
org.apache.spark.sql.AnalysisException: cannot resolve '`3_max(d)`' given input columns: [2_max(`c`), 3_max(`d`), a, 2_max(`d`), 3_max(`c`)];
There must be a simple trick here, any ideas? Thanks.
Seems like a bug, the back ticks caused the problem. One fix here would be to remove the back ticks from the column names:
val pivotedNewName = pivoted.columns.foldLeft(pivoted)((df, col) =>
df.withColumnRenamed(col, col.replace("`", "")))
Now you can select by column names as normal:
pivotedNewName.select("2_max(c)").show
+--------+
|2_max(c)|
+--------+
| 3|
+--------+