Spark Dataframe with pivot and different aggregation, based on the column value (measure_type) - Scala

Spark Dataframe with pivot and different aggregation, based on the column value (measure_type) - Scala - scala

I have a spark dataframe of this type:
scala> val data = Seq((1, "k1", "measureA", 2), (1, "k1", "measureA", 4), (1, "k1", "measureB", 5), (1, "k1", "measureB", 7), (1, "k1", "measureC", 7), (1, "k1", "measureC", 1), (2, "k1", "measureB", 8), (2, "k1", "measureC", 9), (2, "k2", "measureA", 5), (2, "k2", "measureC", 5), (2, "k2", "measureC", 8))
data: Seq[(Int, String, String, Int)] = List((1,k1,measureA,2), (1,k1,measureA,4), (1,k1,measureB,5), (1,k1,measureB,7), (1,k1,measureC,7), (1,k1,measureC,1), (2,k1,measureB,8), (2,k1,measureC,9), (2,k2,measureA,5), (2,k2,measureC,5), (2,k2,measureC,8))
scala> val rdd = spark.sparkContext.parallelize(data)
rdd: org.apache.spark.rdd.RDD[(Int, String, String, Int)] = ParallelCollectionRDD[22] at parallelize at <console>:27
scala> val df = rdd.toDF("ts","key","measure_type","value")
df: org.apache.spark.sql.DataFrame = [ts: int, key: string ... 2 more fields]
scala> df.show
+---+---+------------+-----+
| ts|key|measure_type|value|
+---+---+------------+-----+
| 1| k1| measureA| 2|
| 1| k1| measureA| 4|
| 1| k1| measureB| 5|
| 1| k1| measureB| 7|
| 1| k1| measureC| 7|
| 1| k1| measureC| 1|
| 2| k1| measureB| 8|
| 2| k1| measureC| 9|
| 2| k2| measureA| 5|
| 2| k2| measureC| 5|
| 2| k2| measureC| 8|
+---+---+------------+-----+
I want to pivot on measure_type and apply different aggregation types to the value, depending on measure_type:
measureA -> sum
measureB -> avg
measureC -> max
Then, get the following output dataframe:
+---+---+--------+--------+--------+
| ts|key|measureA|measureB|measureC|
+---+---+--------+--------+--------+
| 1| k1| 6| 6| 7|
| 2| k1| null| 8| 9|
| 2| k2| 5| null| 8|
+---+---+--------+--------+--------+
Thanks a lot.

val ddf = df.groupBy("ts", "key").agg(
sum(when(col("measure_type") === "measureA",col("value"))).as("measureA"),
avg(when(col("measure_type") === "measureB",col("value"))).as("measureB"),
max(when(col("measure_type") === "measureC",col("value"))).as("measureC"))
And results are
scala> ddf.show(false)
+---+---+--------+--------+--------+
|ts |key|measureA|measureB|measureC|
+---+---+--------+--------+--------+
|2 |k2 |5 |null |8 |
|2 |k1 |null |8.0 |9 |
|1 |k1 |6 |6.0 |7 |
+---+---+--------+--------+--------+

I think its tedious to do with traditional pivot function as it will only limit you to one particular aggregate function.
Here is what I would do by mapping a pre-defined list of aggregate functions that I need to perform and apply them on my dataframe giving me 3 extra columns for each aggregate functions and then create another column with value for the measure_type as you mentioned and then drop the 3 columns i created in previous step
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import spark.implicits._
val df = Seq((1, "k1", "measureA", 2), (1, "k1", "measureA", 4), (1, "k1", "measureB", 5), (1, "k1", "measureB", 7), (1, "k1", "measureC", 7), (1, "k1", "measureC", 1), (2, "k1", "measureB", 8), (2, "k1", "measureC", 9), (2, "k2", "measureA", 5), (2, "k2", "measureC", 5), (2, "k2", "measureC", 8)).toDF("ts","key","measure_type","value")
val mapping: Map[String, Column => Column] = Map(
"sum" -> sum, "avg" -> avg, "max" -> max)
val groupBy = Seq("ts","key","measure_type")
val aggregate = Seq("value")
val operations = Seq("sum", "avg", "max")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
val df2 = df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*)
val df3 = df2.withColumn("new_column",
when($"measure_type" === "measureA", $"sum(value)")
.when($"measure_type" === "measureB", $"avg(value)")
.otherwise($"max(value)"))
.drop("sum(value)")
.drop("avg(value)")
.drop("max(value)")
df3 is the dataframe that you need.

Related

Spark: map columns of a dataframe to their ID of the distinct elements

I have the following dataframe of two columns of string type A and B:
val df = (
spark
.createDataFrame(
Seq(
("a1", "b1"),
("a1", "b2"),
("a1", "b2"),
("a2", "b3")
)
)
).toDF("A", "B")
I create maps between distinct elements of each columns and a set of integers
val mapColA = (
df
.select("A")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
val mapColB = (
df
.select("B")
.distinct
.rdd
.zipWithIndex
.collectAsMap
)
Now I want to create a new columns in the dataframe applying those maps to their correspondent columns. For one map only this would be
df.select("A").map(x=>mapColA.get(x)).show()
However I don't understand how to apply each map to their correspondent columns and create two new columns (e.g. with withColumn). The expected result would be
val result = (
spark
.createDataFrame(
Seq(
("a1", "b1", 1, 1),
("a1", "b2", 1, 2),
("a1", "b2", 1, 2),
("a2", "b3", 2, 3)
)
)
).toDF("A", "B", "idA", "idB")
Could you help me?

If I understood correctly, this can be achieved using dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("idA", dense_rank().over(Window.orderBy("A")))
.withColumn("idB", dense_rank().over(Window.orderBy("B")))
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 1|
| a1| b2| 1| 2|
| a1| b2| 1| 2|
| a2| b3| 2| 3|
+---+---+---+---+
If you want to stick with your original code, you can make some modifications:
val mapColA = df.select("A").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val mapColB = df.select("B").distinct().rdd.map(r=>r.getAs[String](0)).zipWithIndex.collectAsMap
val df2 = df.map(r => (r.getAs[String](0), r.getAs[String](1), mapColA.get(r.getAs[String](0)), mapColB.get(r.getAs[String](1)))).toDF("A","B", "idA", "idB")
df2.show
+---+---+---+---+
| A| B|idA|idB|
+---+---+---+---+
| a1| b1| 1| 2|
| a1| b2| 1| 0|
| a1| b2| 1| 0|
| a2| b3| 0| 1|
+---+---+---+---+

Cumulative product in Spark

I try to implement a cumulative product in Spark Scala, but I really don't know how to it. I have the following dataframe:
Input data:
+--+--+--------+----+
|A |B | date | val|
+--+--+--------+----+
|rr|gg|20171103| 2 |
|hh|jj|20171103| 3 |
|rr|gg|20171104| 4 |
|hh|jj|20171104| 5 |
|rr|gg|20171105| 6 |
|hh|jj|20171105| 7 |
+-------+------+----+
And I would like to have the following output:
Output data:
+--+--+--------+-----+
|A |B | date | val |
+--+--+--------+-----+
|rr|gg|20171105| 48 | // 2 * 4 * 6
|hh|jj|20171105| 105 | // 3 * 5 * 7
+-------+------+-----+

As long as the number are strictly positive (0 can be handled as well, if present, using coalesce) as in your example, the simplest solution is to compute the sum of logarithms and take the exponential:
import org.apache.spark.sql.functions.{exp, log, max, sum}
val df = Seq(
("rr", "gg", "20171103", 2), ("hh", "jj", "20171103", 3),
("rr", "gg", "20171104", 4), ("hh", "jj", "20171104", 5),
("rr", "gg", "20171105", 6), ("hh", "jj", "20171105", 7)
).toDF("A", "B", "date", "val")
val result = df
.groupBy("A", "B")
.agg(
max($"date").as("date"),
exp(sum(log($"val"))).as("val"))
Since this uses FP arithmetic the result won't be exact:
result.show
+---+---+--------+------------------+
| A| B| date| val|
+---+---+--------+------------------+
| hh| jj|20171105|104.99999999999997|
| rr| gg|20171105|47.999999999999986|
+---+---+--------+------------------+
but after rounding should good enough for majority of applications.
result.withColumn("val", round($"val")).show
+---+---+--------+-----+
| A| B| date| val|
+---+---+--------+-----+
| hh| jj|20171105|105.0|
| rr| gg|20171105| 48.0|
+---+---+--------+-----+
If that's not enough you can define an UserDefinedAggregateFunction or Aggregator (How to define and use a User-Defined Aggregate Function in Spark SQL?) or use functional API with reduceGroups:
import scala.math.Ordering
case class Record(A: String, B: String, date: String, value: Long)
df.withColumnRenamed("val", "value").as[Record]
.groupByKey(x => (x.A, x.B))
.reduceGroups((x, y) => x.copy(
date = Ordering[String].max(x.date, y.date),
value = x.value * y.value))
.toDF("key", "value")
.select($"value.*")
.show
+---+---+--------+-----+
| A| B| date|value|
+---+---+--------+-----+
| hh| jj|20171105| 105|
| rr| gg|20171105| 48|
+---+---+--------+-----+

You can solve this using either collect_list+UDF or an UDAF. UDAF may be more efficient, but harder to implement due to the local aggregation.
If you have a dataframe like this :
+---+---+
|key|val|
+---+---+
| a| 1|
| a| 2|
| a| 3|
| b| 4|
| b| 5|
+---+---+
You can invoke an UDF :
val prod = udf((vals:Seq[Int]) => vals.reduce(_ * _))
df
.groupBy($"key")
.agg(prod(collect_list($"val")).as("val"))
.show()
+---+---+
|key|val|
+---+---+
| b| 20|
| a| 6|
+---+---+

Since Spark 2.4, you could also compute this using the higher order function aggregate:
import org.apache.spark.sql.functions.{expr, max}
val df = Seq(
("rr", "gg", "20171103", 2),
("hh", "jj", "20171103", 3),
("rr", "gg", "20171104", 4),
("hh", "jj", "20171104", 5),
("rr", "gg", "20171105", 6),
("hh", "jj", "20171105", 7)
).toDF("A", "B", "date", "val")
val result = df
.groupBy("A", "B")
.agg(
max($"date").as("date"),
expr("""
aggregate(
collect_list(val),
cast(1 as bigint),
(acc, x) -> acc * x)""").alias("val")
)

Spark 3.2+
product(e: Column): Column
Aggregate function: returns the product of all numerical elements in a group.
Scala
import spark.implicits._
var df = Seq(
("rr", "gg", 20171103, 2),
("hh", "jj", 20171103, 3),
("rr", "gg", 20171104, 4),
("hh", "jj", 20171104, 5),
("rr", "gg", 20171105, 6),
("hh", "jj", 20171105, 7)
).toDF("A", "B", "date", "val")
df = df.groupBy("A", "B").agg(max($"date").as("date"), product($"val").as("val"))
df.show(false)
// +---+---+--------+-----+
// |A |B |date |val |
// +---+---+--------+-----+
// |hh |jj |20171105|105.0|
// |rr |gg |20171105|48.0 |
// +---+---+--------+-----+
PySpark
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data = [('rr', 'gg', 20171103, 2),
('hh', 'jj', 20171103, 3),
('rr', 'gg', 20171104, 4),
('hh', 'jj', 20171104, 5),
('rr', 'gg', 20171105, 6),
('hh', 'jj', 20171105, 7)]
df = spark.createDataFrame(data, ['A', 'B', 'date', 'val'])
df = df.groupBy('A', 'B').agg(F.max('date').alias('date'), F.product('val').alias('val'))
df.show()
#+---+---+--------+-----+
#| A| B| date| val|
#+---+---+--------+-----+
#| hh| jj|20171105|105.0|
#| rr| gg|20171105| 48.0|
#+---+---+--------+-----+

Get the last element of a window in Spark 2.1.1

I have a dataframe in which I have subcategories, and want the last element of each of these subcategories.
val windowSpec = Window.partitionBy("name").orderBy("count")
sqlContext
.createDataFrame(
Seq[(String, Int)](
("A", 1),
("A", 2),
("A", 3),
("B", 10),
("B", 20),
("B", 30)
))
.toDF("name", "count")
.withColumn("firstCountOfName", first("count").over(windowSpec))
.withColumn("lastCountOfName", last("count").over(windowSpec))
.show()
returns me something strange:
+----+-----+----------------+---------------+
|name|count|firstCountOfName|lastCountOfName|
+----+-----+----------------+---------------+
| B| 10| 10| 10|
| B| 20| 10| 20|
| B| 30| 10| 30|
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
+----+-----+----------------+---------------+
As we can see, the first value returned is correctly computed, but the last isn't, it's always the current value of the column.
Has someone a solution to do what I want?

According to the issue SPARK-20969, you should be able to get the expected results by defining adequate bounds to your window, as shown below.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val windowSpec = Window
.partitionBy("name")
.orderBy("count")
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
sqlContext
.createDataFrame(
Seq[(String, Int)](
("A", 1),
("A", 2),
("A", 3),
("B", 10),
("B", 20),
("B", 30)
))
.toDF("name", "count")
.withColumn("firstCountOfName", first("count").over(windowSpec))
.withColumn("lastCountOfName", last("count").over(windowSpec))
.show()
Alternatively, if your are ordering on the same column you are computing first and last, you can change for min and max with a non-ordered window, then it should also work properly.

How to combine two spark data frames in sorted order

I want to combine two dataframes a and b into a dataframe c that is sorted on a column.
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num")
val c = // how do I sort on char column?
Here is the result I want:
a.show() b.show() c.show()
+----+---+ +----+---+ +----+---+
|char|num| |char|num| |char|num|
+----+---+ +----+---+ +----+---+
| a| 1| | b| 4| | a| 1|
| c| 2| | d| 5| | b| 4|
| e| 3| +----+---+ | c| 2|
+----+---+ | d| 5|
| e| 3|
+----+---+

In simple, you can use sort() on each dataframe and union().
val a = Seq(("a", 1), ("c", 2), ("e", 3)).toDF("char", "num").sort($"char")
val b = Seq(("b", 4), ("d", 5)).toDF("char", "num").sort($"char")
val c = a.union(b).sort($"char")

if you want to do union for multiple dataframes we can try this way.
val df1 = sc.parallelize(List(
(50, 2, "arjun"),
(34, 4, "bob")
)).toDF("age", "children","name")
val df2 = sc.parallelize(List(
(51, 3, "jane"),
(35, 5, "bob")
)).toDF("age", "children","name")
val df3 = sc.parallelize(List(
(50, 2,"arjun"),
(34, 4,"bob")
)).toDF("age", "children","name")
val result= Seq(df1, df2, df3)
val res_union=result.reduce(_ union _).sort($"age",$"name",$"children")
res_union.show()

How to update column based on a condition (a value in a group)?

I have the following df:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| fn| red|
| 2| fn| blue|
| 3| fn|green|
+---+----+-----+
If any of the color column values is red, then I all values of the color column should be updated to be red, as below:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| fn| red|
| 2| fn| red|
| 3| fn| red|
+---+----+-----+
I could not figure it out. Please help; I have tried following code:
val gp=jdbcDF.filter($"dept".contains("fn"))
//.withColumn("newone",when($"dept"==="fn","RED").otherwise("NULL"))
gp.show()
gp.map(
row=>{
val row1=row.getAs[String](1)
var row2=row.getAs[String](2)
val make=if(row1 =="fn") row2="red"
Row(row(0),row(1),make)
}
).collect().foreach(println)

Given:
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "fn", "color")
Do the calculation:
val redOrNot = df.groupBy("fn")
.agg(collect_set('color) as "values")
.withColumn("hasRed", array_contains('values, "red"))
// gives null for no option
val colorPicker = when('hasRed, "red")
val result = df.join(redOrNot, "fn")
.withColumn("resultColor", colorPicker)
.withColumn("color", coalesce('resultColor, 'color)) // skips nulls that leads to the answer
.select('id, 'fn, 'color)
The result looks as follows (that seems to be an answer):
scala> result.show
+---+---+-----+
| id| fn|color|
+---+---+-----+
| 1| fn| red|
| 2| fn| red|
| 3| fn| red|
| 4| aa| blue|
| 5| aa|green|
| 6| bb| red|
| 7| bb| red|
| 8| aa| blue|
+---+---+-----+
You can chain when operators and have a default value with otherwise. Consult the scaladoc of when operator.
I think you could do something very similar (and perhaps more efficient) using windowed operators or user-defined aggregate functions (UDAF), but...well...don't currently know how to do it. Leaving the comment here to inspire others ;-)
p.s. Learnt a lot! Thanks for the idea!

Efficient solution which doesn't require expensive grouping:
// All groups with `red`
df.where($"color" === "red").select($"fn".alias("fn_")).distinct
// Join with input
.join(df.as("df"), $"fn" === $"fn_", "rightouter")
// Replace `color`
.withColumn("color", when($"fn_"isNull, $"color").otherwise(lit("red")))
.drop("fn_")

You are conditionally updating the DataFrame if it satisfies a certain property. In this case the property is "the color column contains 'red'". The idiomatic way to express this is to filter with the desired predicate and then determine whether any rows satisfy it. There is no need for a join.
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.DataFrame
def makeAllRedIfAnyAreRed(df: DataFrame) = {
val containsRed = df.filter(df("color") === "red").count() > 0
if (containsRed) df.withColumn("color", lit("red")) else df
}

Spark 2.2.0:
Sample Dataframe ( taken from above examples)
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "dept", "color")
created a UDF to perform the update by checking the condition.
val replace_val = udf((x: String,y:String) => if (Option(x).getOrElse("").equalsIgnoreCase("fn")&&(!y.equalsIgnoreCase("red"))) "red" else y)
val final_df = df.withColumn("color", replace_val($"dept",$"color"))
final_df.show()
output:
spark 1.6:
val conf = new SparkConf().setMaster("local").setAppName("My app")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// For implicit conversions like converting RDDs to DataFrames
val df = sc.parallelize(Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
) ).toDF("id","dept","color")
val replace_val = udf((x: String,y:String) => if (Option(x).getOrElse("").equalsIgnoreCase("fn")&&(!y.equalsIgnoreCase("red"))) "red" else y)
val final_df = df.withColumn("color", replace_val($"dept",$"color"))
final_df.show()

As there could be few rows in filtered dataframe I'm adding solution with isin() and .withColumn() combination.
Sample DataFrame
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "dept", "color")
Now Let's pick only depts which have at least one red color row and place it in broadcast variable like below.
val depts = sc.broadcast(df.filter($"color" === "red").select(collect_set("dept")).first.getSeq[String](0)))
Update red color for filtered depts records.
isin() takes a vararg so convert list to vararg (depts.value:_*)
//creating new column by giving diff name (clr) to see the diff
val result = df.withColumn("clr", when($"dept".isin(depts.value:_*),lit("red"))
.otherwise($"color"))
result.show()
+---+----+-----+-----+
| id|dept|color| clr|
+---+----+-----+-----+
| 1| fn| red| red|
| 2| fn| blue| red|
| 3| fn|green| red|
| 4| aa| blue| blue|
| 5| aa|green|green|
| 6| bb| red| red|
| 7| bb| red| red|
| 8| aa| blue| blue|
+---+----+-----+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark Dataframe with pivot and different aggregation, based on the column value (measure_type) - Scala - scala

Related

Spark: map columns of a dataframe to their ID of the distinct elements

Cumulative product in Spark

Get the last element of a window in Spark 2.1.1

How to combine two spark data frames in sorted order

How to update column based on a condition (a value in a group)?

Categories

Resources