Dynamically apply aggregate function in SPARK data frame

Dynamically apply aggregate function in SPARK data frame - scala

How do I parameterize the below spark function. The groupBy and Pivot values are constant. I need to parameterize
var final_df_transpose=df_transpose.groupBy("_id").pivot("Type").agg(first("Value").alias("Value"),first("OType").alias("OType"),first("DateTime").alias("DateTime"))
Unable to pass alias as dynamically in the above scenario.
agg_Map scala.collection.mutable.Map[String,String] = Map( OType -> first, Type -> first, Value -> first, DateTime -> first)
var agg_Map = collection.mutable.Map[String, String]()
for (aggDataCol <- fin_agg_col) {
agg_Map1 += (aggDataCol -> "first")
}
df_transpose.groupBy("_id").pivot("Type").agg(agg_Map.toMap).show

I can think of two ways, but I'm not happy with either.
In the first, define your aggregates as a list of Columns. The annoying thing here is that to satisfy the method signature, you need to add a dummy column and then remove it after the aggregation:
scala> val in = spark.read.option("header", true).csv("""_id,Type,Value,OType,DateTime
| 0,a,b,c,d
| 1,aaa,bbb,ccc,ddd""".split("\n").toSeq.toDS)
in: org.apache.spark.sql.DataFrame = [_id: string, Type: string ... 3 more fields]
scala> in.show
+---+----+-----+-----+--------+
|_id|Type|Value|OType|DateTime|
+---+----+-----+-----+--------+
| 0| a| b| c| d|
| 1| aaa| bbb| ccc| ddd|
+---+----+-----+-----+--------+
scala> val aggColumns = Seq("Value", "OType", "DateTime").map{c => first(c).alias(c)}
aggColumns: Seq[org.apache.spark.sql.Column] = List(first(Value, false) AS `Value`, first(OType, false) AS `OType`, first(DateTime, false) AS `DateTime`)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(lit("dummy"), aggColumns : _*)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_dummy: string ... 7 more fields]
scala> df_intermediate.show
+---+-------+-------+-------+----------+---------+---------+---------+------------+
|_id|a_dummy|a_Value|a_OType|a_DateTime|aaa_dummy|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
| 0| dummy| b| c| d| dummy| null| null| null|
| 1| dummy| null| null| null| dummy| bbb| ccc| ddd|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
scala> val df_final = df_intermediate.drop(df_intermediate.schema.collect{case c if c.name.endsWith("_dummy") => c.name} : _*)
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_Value: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_Value|a_OType|a_DateTime|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| b| c| d| null| null| null|
| 1| null| null| null| bbb| ccc| ddd|
+---+-------+-------+----------+---------+---------+------------+
In the second, continue using a Map of agg expressions, then use a regex to find the renamed columns and change them back:
scala> val aggExprs = Map(("OType" -> "first"), ("Value" -> "first"), "DateTime" -> "first")
aggExprs: scala.collection.immutable.Map[String,String] = Map(OType -> first, Value -> first, DateTime -> first)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(aggExprs)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_first(OType, false): string ... 5 more fields]
scala> df_intermediate.show
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
|_id|a_first(OType, false)|a_first(Value, false)|a_first(DateTime, false)|aaa_first(OType, false)|aaa_first(Value, false)|aaa_first(DateTime, false)|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
scala> val regex = "^(.*)_first\\((.*), false\\)$".r
regex: scala.util.matching.Regex = ^(.*)_first\((.*), false\)$
scala> val replacements = df_intermediate.schema.collect{ case c if regex.findFirstMatchIn(c.name).isDefined =>
| val regex(pivotVal, colName) = c.name
| c.name -> s"${pivotVal}_$colName"
| }.toMap
replacements: scala.collection.immutable.Map[String,String] = Map(a_first(DateTime, false) -> a_DateTime, aaa_first(DateTime, false) -> aaa_DateTime, aaa_first(OType, false) -> aaa_OType, a_first(Value, false) -> a_Value, a_first(OType, false) -> a_OType, aaa_first(Value, false) -> aaa_Value)
scala> val df_final = replacements.foldLeft(df_intermediate){(df, c) => df.withColumnRenamed(c._1, c._2)}
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_OType: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_OType|a_Value|a_DateTime|aaa_OType|aaa_Value|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+-------+-------+----------+---------+---------+------------+
Take your pick, but both involve some unneeded steps.

Related

How can I add sequence of string as column on dataFrame and make as transforms

I have a sequence of string
val listOfString : Seq[String] = Seq("a","b","c")
How can I make a transform like
def addColumn(example: Seq[String]): DataFrame => DataFrame {
some code which returns a transform which add these String as column to dataframe
}
input
+-------
| id
+-------
| 1
+-------
output
+-------+-------+----+-------
| id | a | b | c
+-------+-------+----+-------
| 1 | 0 | 0 | 0
+-------+-------+----+-------
I am only interested in making it as transform

You can use the transform method of the datasets together with a single select statement:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.lit
def addColumns(extraCols: Seq[String])(df: DataFrame): DataFrame = {
val selectCols = df.columns.map{col(_)} ++ extraCols.map{c => lit(0).as(c)}
df.select(selectCols :_*)
}
// usage example
val yourExtraColumns : Seq[String] = Seq("a","b","c")
df.transform(addColumns(yourExtraColumns))
Resources
https://towardsdatascience.com/dataframe-transform-spark-function-composition-eb8ec296c108
https://mungingdata.com/apache-spark/chaining-custom-dataframe-transformations/

Use .toDF() and pass your listOfString.
Example:
//sample dataframe
df.show()
//+---+---+---+
//| _1| _2| _3|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
df.toDF(listOfString:_*).show()
//+---+---+---+
//| a| b| c|
//+---+---+---+
//| 0| 0| 0|
//+---+---+---+
UPDATE:
Use foldLeft to add the columns to the existing dataframe with values.
val df=Seq(("1")).toDF("id")
val listOfString : Seq[String] = Seq("a","b","c")
val new_df=listOfString.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+
//or creating a function
import org.apache.spark.sql.DataFrame
def addColumns(extraCols: Seq[String],df: DataFrame): DataFrame = {
val new_df=extraCols.foldLeft(df){(df,colName) => df.withColumn(colName,lit("0"))}
return new_df
}
addColumns(listOfString,df).show()
//+---+---+---+---+
//| id| a| b| c|
//+---+---+---+---+
//| 1| 0| 0| 0|
//+---+---+---+---+

Adding a count column to my sequence in Scala

Given the code below, how would I go about adding a count column? (e.g. .count("*").as("count"))
Final output to look like something like this:
+---+------+------+-----------------------------+------
| id|sum(d)|max(b)|concat_ws(,, collect_list(s))|count|
+---+------+------+-----------------------------+------
| 1| 1.0| true| a. | 1 |
| 2| 4.0| true| b,b| 2 |
| 3| 3.0| true| c. | 1 |
Current code is below:
val df =Seq(
(1, 1.0, true, "a"),
(2, 2.0, false, "b")
(3, 3.0, false, "b")
(2, 2.0, false, "c")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf => (sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c))) // "append"
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id")
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(aggExprs.head,aggExprs.tail:_*)
.show()

You can simply append count("*").as("count") to aggExprs.tail in your agg, as shown below:
df.
groupBy("id").agg(aggExprs.head, aggExprs.tail :+ count("*").as("count"): _*).
show
// +---+------+------+-----------------------------+-----+
// | id|sum(d)|max(b)|concat_ws(,, collect_list(s))|count|
// +---+------+------+-----------------------------+-----+
// | 1| 1.0| true| a| 1|
// | 3| 3.0| false| b| 1|
// | 2| 4.0| false| b,c| 2|
// +---+------+------+-----------------------------+-----+

Spark Scala row-wise average by handling null

I've a dataframe with high volume of data and "n" number of columns.
df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
| col1| col2| col3| col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
| null| null| null| null| null| null|
| 14.0| 5.0| 73.0| null| null| null|
| null| null| 28.25| null| null| null|
| null| null| null| null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
| 26.25| null| null| 2.0| null| null|
| null| null| null| null| null| null|
| 54.46| 89.475| null| null| null| null|
| null| 12.39| null| null| null| null|
| null| 58.0| 19.45| 1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+
I need to perform rowwise average keeping in mind not to consider the cell with "null" for averaging.
This needs to be implemented in Spark / Scala. I've tried to explain the same as in the attached image
What I have tried so far :
By referring - Calculate row mean, ignoring NAs in Spark Scala
val df = df_raw.schema.fieldNames.filter(f => f.contains("colname"))
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean")
Where df_raw contains columns which needs to be aggregated (of course rowise). There are more than 80 columns. Arbitrarily they have data and null, count of Null needs to be ignored in the denominator while calculating average. It works fine, when all the column contain data, even a single Null in a column returns Null
Edit:
I've tried to adjust this answer by Terry Dactyl
def average(l: Seq[Double]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Double]))
val rowAvgDF = df_avg_calc.select(avgUdf(array($"col1",$"col2",$"col3",$"col4",$"col5",$"col6")).as("row_avg"))
rowAvgDF.show(10,false)
rowAvgDF: org.apache.spark.sql.DataFrame = [row_avg: double]
+------------------+
|row_avg |
+------------------+
|0.0 |
|15.333333333333334|
|4.708333333333333 |
|0.0 |
|61.58583333333333 |
|4.708333333333333 |
|0.0 |
|23.989166666666666|
|2.065 |
|39.63 |
+------------------+

Spark >= 2.4
It is possible to use aggregate:
val row_mean = expr("""aggregate(
CAST(array(_1, _2, _3) AS array<double>),
-- Initial value
-- Note that aggregate is picky about types
CAST((0.0 as sum, 0.0 as n) AS struct<sum: double, n: double>),
-- Merge function
(acc, x) -> (
acc.sum + coalesce(x, 0.0),
acc.n + CASE WHEN x IS NULL THEN 0.0 ELSE 1.0 END),
-- Finalize function
acc -> acc.sum / acc.n)""")
Usage:
df.withColumn("row_mean", row_mean).show
Result:
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+
Version independent
Compute sum and count of NOT NULL columns and divide one over another:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def row_mean(cols: Column*) = {
// Sum of values ignoring nulls
val sum = cols
.map(c => coalesce(c, lit(0)))
.foldLeft(lit(0))(_ + _)
// Count of not null values
val cnt = cols
.map(c => when(c.isNull, 0).otherwise(1))
.foldLeft(lit(0))(_ + _)
sum / cnt
}
Example data:
val df = Seq(
(None, None, None),
(Some(2.0), None, None),
(Some(50.0), Some(34.0), None),
(Some(1.0), Some(2.0), Some(3.0))
).toDF
Result:
df.withColumn("row_mean", row_mean($"_1", $"_2", $"_3")).show
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+

def average(l: Seq[Integer]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Integer]))
val df = List((Some(1),Some(2)), (Some(1), None), (None, None)).toDF("a", "b")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
res0: Array[org.apache.spark.sql.Row] = Array([1.5], [1.0], [null])
Testing on the data you supplied gives the correct result:
val df = List(
(Some(10),Some(5), Some(5), None, None),
(None, Some(5), Some(5), None, Some(5)),
(Some(2), Some(8), Some(5), Some(1), Some(2)),
(None, None, None, None, None)
).toDF("col1", "col2", "col3", "col4", "col5")
Array[org.apache.spark.sql.Row] = Array([6.666666666666667], [5.0], [3.6], [null])
Note if you have columns you do not want included make sure they are filtered when populating the array passed to the UDF.
Finally:
val df = List(
(Some(14), Some(5), Some(73), None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])
Which again is the correct result.
If you want to use Doubles...
def average(l: Seq[java.lang.Double]): Option[java.lang.Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _) / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[java.lang.Double]))
val df = List(
(Some(14.0), Some(5.0), Some(73.0), None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()

Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>

There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show

import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
(0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)
How can I map over the array and get my output?

You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)
aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+

First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults = combined.rdd.map{
x =>
(x.getInt(0), x.toSeq.tail.map(_.asInstanceOf[Double]))
}.map{
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual(g.map(x => x(0)))
val g2 = checkAllEqual(g.map(x => x(1)))
val g3 = checkAllEqual(g.map(x => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
finalResults.show()
This will print:
+---+-----+-----+-----+
| id|prod1|prod2|prod3|
+---+-----+-----+-----+
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|
+---+-----+-----+-----+