Operate within a group by and populate additional columns - scala

I have a dataframes as below:
+------+------+---+------+
|field1|field2|id |Amount|
+------+------+---+------+
|A |B |002|10.0 |
|A |B |003|12.0 |
|A |B |005|15.0 |
|C |B |002|20.0 |
|C |B |003|22.0 |
|C |B |005|25.0 |
+------+------+---+------+
I need to convert it to :
+------+------+---+-------+---+-------+---+-------+
|field1|field2|002|002_Amt|003|003_Amt|005|005_Amt|
+------+------+---+-------+---+-------+---+-------+
|A |B |002|10.0 |003|12.0 |005|15.0 |
|C |B |002|20.0 |003|22.0 |005|25.0 |
+------+------+---+-------+---+-------+---+-------+
Please advise!

Your final dataframe column depends on id column so you need to store the distinct ids in a separate array.
import org.apache.spark.sql.functions._
val distinctIds = df.select(collect_list("id")).rdd.first().get(0).asInstanceOf[mutable.WrappedArray[String]].distinct
Next step is to filter each of the distinctIds and join them
val first = distinctIds.head
var finalDF = df.filter($"id" === first).withColumnRenamed("id", first).withColumnRenamed("Amount", first+"_Amt")
for(str <- distinctIds.tail){
var tempDF = df.filter($"id" === str).withColumnRenamed("id", str).withColumnRenamed("Amount", str+"_Amt")
finalDF = finalDF.join(tempDF, Seq("field1", "field2"), "left")
}
finalDF.show(false)
You should have your desired output as
+------+------+---+-------+---+-------+---+-------+
|field1|field2|002|002_Amt|003|003_Amt|005|005_Amt|
+------+------+---+-------+---+-------+---+-------+
|A |B |002|10.0 |003|12.0 |005|15.0 |
|C |B |002|20.0 |003|22.0 |005|25.0 |
+------+------+---+-------+---+-------+---+-------+
Var is never recommended for scala. So you can create a recursive function to do the above logic as below
def getFinalDF(first: Boolean, array: List[String], df: DataFrame, tdf: DataFrame) : DataFrame = array match {
case head :: tail => {
if(first) {
getFinalDF(false, tail, df, df.filter($"id" === head).withColumnRenamed("id", head).withColumnRenamed("Amount", head + "_Amt"))
}
else{
val tempDF = df.filter($"id" === head).withColumnRenamed("id", head).withColumnRenamed("Amount", head+"_Amt")
getFinalDF(false, tail, df, tdf.join(tempDF, Seq("field1", "field2"), "left"))
}
}
case Nil => tdf
}
and call the recursive function as
getFinalDF(true, distinctIds.toList, df, df).show(false)
You should have the same output.

Related

Scala -- apply a custom if-then on a dataframe

I have this kind of dataset:
val cols = Seq("col_1","col_2")
val data = List(("a",1),
("b",1),
("a",2),
("c",3),
("a",3))
val df = spark.createDataFrame(data).toDF(cols:_*)
+-----+-----+
|col_1|col_2|
+-----+-----+
|a |1 |
|b |1 |
|a |2 |
|c |3 |
|a |3 |
+-----+-----+
I want to add an if-then column based on the existing columns.
df
.withColumn("col_new",
when(col("col_2").isin(2, 5), "str_1")
.when(col("col_2").isin(4, 6), "str_2")
.when(col("col_2").isin(1) && col("col_1").contains("a"), "str_3")
.when(col("col_2").isin(3) && col("col_1").contains("b"), "str_1")
.when(col("col_2").isin(1,2,3), "str_4")
.otherwise(lit("other")))
Instead of the list of when-then statements, I would prefer to apply a custom function. In Python I would run a lambda & map.
thank you!

How to efficiently map over DF and use combination of outputs?

Given a DF, let's say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).
What is the best way to get a resulting df that will contain the original df A and the 3 added columns?
val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
df.withColumn("method3", col("num1")+col("num2"))
}
One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want -- with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?
Efficient way to do this is using select.
select is faster than the foldLeft if you have very huge data - Check this post
You can build required expressions & use that inside select, check below code.
scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1 |2 |
|2 |5 |
|3 |7 |
+----+----+
scala> val colExpr = Seq(
$"num1",
$"num2",
($"num1"/$"num2").as("method1"),
($"num1" * $"num2").as("method2"),
($"num1" + $"num2").as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+
Update
Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.
scala> def add(
num1:Column, // May be you can try to use variable args here if you want.
num2:Column,
f: (Column,Column) => Column
): Column = f(num1,num2)
For Example, varargs & while invoking this method you need to pass required columns at the end.
def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)
Invoking add function.
scala> val colExpr = Seq(
$"num1",
$"num2",
add($"num1",$"num2",(_ / _)).as("method1"),
add($"num1", $"num2",(_ * _)).as("method2"),
add($"num1", $"num2",(_ + _)).as("method3")
)
Final Output
scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1 |method2|method3|
+----+----+-------------------+-------+-------+
|1 |2 |0.5 |2 |3 |
|2 |5 |0.4 |10 |7 |
|3 |7 |0.42857142857142855|21 |10 |
+----+----+-------------------+-------+-------+

Add identical rows to a Spark Dataframe using an integer

Assuming the following Dataframe df1 :
df1 :
+---------+--------+-------+
|A |B |C |
+---------+--------+-------+
|toto |tata |titi |
+---------+--------+-------+
I have the N = 3 integer which I want to use in order to create 3 duplicates in the df2 Dataframe using df1 :
df2 :
+---------+--------+-------+
|A |B |C |
+---------+--------+-------+
|toto |tata |titi |
|toto |tata |titi |
|toto |tata |titi |
+---------+--------+-------+
Any ideas ?
From Spark-2.4+ use arrays_zip + array_repeat + explode functions for this case.
val df=Seq(("toto","tata","titi")).toDF("A","B","C")
df.withColumn("arr",explode(array_repeat(arrays_zip(array("A"),array("B"),array("c")),3))).
drop("arr").
show(false)
//or dynamic way
val cols=df.columns.map(x => col(x))
df.withColumn("arr",explode(array_repeat(arrays_zip(array(cols:_*)),3))).
drop("arr").
show(false)
//+----+----+----+
//|A |B |C |
//+----+----+----+
//|toto|tata|titi|
//|toto|tata|titi|
//|toto|tata|titi|
//+----+----+----+
You can use foldLeft along with Dataframe's union
import org.apache.spark.sql.DataFrame
object JoinDataFrames {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
val df = List(("toto","tata","titi")).toDF("A","B","C")
val N = 3;
val resultDf = (1 until N).foldLeft( df)((dfInner : DataFrame, count : Int) => {
df.union(dfInner)
})
resultDf.show()
}
}

Scala: How to group by a timestamp an Iterable[T] into an Iterable [T]

I would like to write code that would group a line iterator inputs: Iterator[InputRow] by timestamp an unique items (by unit and eventName), i.e. eventTime should be the latest timestamp in the new Iterator[T] list where InputRow is defined as
case class InputRow(unit:Int, eventName: String, eventTime:java.sql.Timestamp, value: Int)
Example data before grouping:
+-----------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11 |2 |B |1 |
|2018-06-02 16:05:12 |1 |A |2 |
|2018-06-02 16:05:13 |2 |A |2 |
|2018-06-02 16:05:14 |1 |A |3 |
|2018-06-02 16:05:15 |2 |A |3 |
After:
+-----------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-----------------------+----+---------+-----+
|2018-06-02 16:05:11 |2 |B |1 |
|2018-06-02 16:05:14 |1 |A |3 |
|2018-06-02 16:05:15 |2 |A |3 |
What is a good approach to writing the above code in Scala?
Good news: your question already contains the verbs that correspond to the functional calls to be used in the code: group by, sort by (latest timestamp).
To sort InputRow by latest timestamp, we'll need an implicit ordering:
implicit val rowSortByTimestamp: Ordering[InputRow] =
(r1: InputRow, r2: InputRow) => r1.eventTime.compareTo(r2.eventTime)
// or shorter:
// implicit val rowSortByTimestamp: Ordering[InputRow] =
// _.eventTime compareTo _.eventTime
And now, having
val input: Iterator[InputRow] = // input data
Let's group them by (unit, eventName)
val result = input.toSeq.groupBy(row => (row.unit, row.eventName))
then extract the one with the latest timestamp
.map { case (gr, rows) => rows.sorted.last }
and sort from ealiest to latest
.toSeq.sorted
The result is
InputRow(2,B,2018-06-02 16:05:11.0,1)
InputRow(1,A,2018-06-02 16:05:14.0,3)
InputRow(2,A,2018-06-02 16:05:15.0,3)
You can use struct built-in function to combine eventTime and value column as struct so that max by eventTime (latest) can be taken when groupBy unit and eventName and aggregating, which should give you your desired output
import org.apache.spark.sql.functions._
df.withColumn("struct", struct("eventTime", "value"))
.groupBy("unit", "eventName")
.agg(max("struct").as("struct"))
.select(col("struct.eventTime"), col("unit"), col("eventName"), col("struct.value"))
as
+-------------------+----+---------+-----+
|eventTime |unit|eventName|value|
+-------------------+----+---------+-----+
|2018-06-02 16:05:14|1 |A |3 |
|2018-06-02 16:05:11|2 |B |1 |
|2018-06-02 16:05:15|2 |A |3 |
+-------------------+----+---------+-----+
You can accomplish that with a foldLeft and a map:
val grouped: Map[(Int, String), InputRow] =
rows
.foldLeft(Map.empty[(Int, String), Seq[InputRow]])({ case (acc, row) =>
val key = (row.unit, row.eventName)
// Get from the accumulator the Seq that already exists or Nil if
// this key has never been seen before
val value = acc.getOrElse(key, Nil)
// Update the accumulator
acc + (key -> (value :+ row))
})
// Get the last element from the list of rows when grouped by unit and event.
.map({case (k, v) => k -> v.last})
This assumes that the eventTimes are already stored in sorted order. If this is not a safe assumption, you can define an implicit Ordering for java.sql.Timestamp and replace v.last with v.maxBy(_.eventTime).
See here.
Edit
Or use .groupBy(row => (row.unit, row.eventName)) instead of the foldLeft:
implicit val ordering: Ordering[Timestamp] = _ compareTo _
val grouped = rows.groupBy(row => (row.unit, row.eventName))
.values
.map(_.maxBy(_.eventTime))

How to return selectively multiple rows from one rows in Scala

There is a DataFrame, "rawDF" and its columns are
time |id1|id2|...|avg_value|max_value|min_value|std_value|range_value|..
10/1/2015|1 |3 |...|0.0 |0.2 |null |null |null | ...
10/2/2015|2 |3 |...|null |null |0.3 |0.4 |null | ...
10/3/2015|3 |5 |...|null |null |null |0.4 |0.5 | ...
For each row, I'd like to return multiple rows based on this five "values" (avg, min, max, std, range). But, if the value is null, I'd like to skip.
So, so output should be
10/1/2015|1 |3 |...|0.0
10/1/2015|1 |3 |...|0.2
10/2/2015|2 |3 |...|0.3
10/2/2015|2 |3 |...|0.4
10/3/2015|3 |5 |...|0.4
10/3/2015|3 |5 |...|0.5
I'm not much familiar with Scala, so, I'm struggling with this.
val procRDD = rawDF.flatMap( x => for(valInd <-10 to 14) yield {
if(x.get(valInd) != null) { ...)) }
} )
This code includes null return.
So, can you give me some idea?
It is a little bit strange requirement but as long as you don't need information about the source column and all values are of the same type you can simply explode and drop nulls:
import org.apache.spark.sql.functions.{array, explode}
val toExpand = Seq(
"avg_value", "max_value", "min_value", "std_value", "range_value"
)
// Collect *_value columns into a single Array and explode
val expanded = df.withColumn("value", explode(array(toExpand.map(col): _*)))
val result = toExpand
.foldLeft(expanded)((df, c) => df.drop(c)) // Drop obsolete columns
.na.drop(Seq("value")) // Drop rows with null value
Here is my solution. If you have better one, let me know.
val procRDD = rawDF.flatMap( x =>
for(valInd <-10 to 14) yield { // valInd represents column number
if(x.get(valInd) != null) {
try { Some( ..) }
catch { case e: Exception => None}
}else None
})
.filter({case Some(y) => true; case None=> false})
.map(_.get)
Actually, I was looking for filter and map and how to put commands inside.