How to do cumulative sum based on conditions in spark scala - scala

I have the below data and final_column is the exact output what I am trying to get. I am trying to do cumulative sum of flag and want to rest if flag is 0 then set value to 0 as below data
cola date flag final_column
a 2021-10-01 0 0
a 2021-10-02 1 1
a 2021-10-03 1 2
a 2021-10-04 0 0
a 2021-10-05 0 0
a 2021-10-06 0 0
a 2021-10-07 1 1
a 2021-10-08 1 2
a 2021-10-09 1 3
a 2021-10-10 0 0
b 2021-10-01 0 0
b 2021-10-02 1 1
b 2021-10-03 1 2
b 2021-10-04 0 0
b 2021-10-05 0 0
b 2021-10-06 1 1
b 2021-10-07 1 2
b 2021-10-08 1 3
b 2021-10-09 1 4
b 2021-10-10 0 0
I have tried like
import org.apache.spark.sql.functions._
df.withColumn("final_column",expr("sum(flag) over(partition by cola order date asc)"))
I have tried to add condition like case when flag = 0 then 0 else 1 end inside sum function but not working.

You can define a column group using conditional sum on flag, then using row_number with a Window partitioned by cola and group gives the result you want:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn(
"group",
sum(when(col("flag") === 0, 1).otherwise(0)).over(Window.partitionBy("cola").orderBy("date"))
).withColumn(
"final_column",
row_number().over(Window.partitionBy("cola", "group").orderBy("date")) - 1
).drop("group")
result.show
//+----+-----+----+------------+
//|cola| date|flag|final_column|
//+----+-----+----+------------+
//| b|44201| 0| 0|
//| b|44202| 1| 1|
//| b|44203| 1| 2|
//| b|44204| 0| 0|
//| b|44205| 0| 0|
//| b|44206| 1| 1|
//| b|44207| 1| 2|
//| b|44208| 1| 3|
//| b|44209| 1| 4|
//| b|44210| 0| 0|
//| a|44201| 0| 0|
//| a|44202| 1| 1|
//| a|44203| 1| 2|
//| a|44204| 0| 0|
//| a|44205| 0| 0|
//| a|44206| 0| 0|
//| a|44207| 1| 1|
//| a|44208| 1| 2|
//| a|44209| 1| 3|
//| a|44210| 0| 0|
//+----+-----+----+------------+
row_number() - 1 in this case is just equivalent to sum(col("flag")) as flag values are always 0 or 1. So the above final_column can also be written as:
.withColumn(
"final_column",
sum(col("flag")).over(Window.partitionBy("cola", "group").orderBy("date"))
)

Related

Window function with PySpark

I have a PySpark Dataframe and my goal is to create a Flag column whose value depends on the value of the Amount column.
Basically, for each Group, I want to know if in any of the first three months, there is an amount greater than 0 and if that is the case, the value of the Flag column will be 1 for all the group, otherwise the value will be 0.
I will include an example to clarify a bit better.
Initial PySpark Dataframe:
Group
Month
Amount
A
1
0
A
2
0
A
3
35
A
4
0
A
5
0
B
1
0
B
2
0
C
1
0
C
2
0
C
3
0
C
4
13
D
1
0
D
2
24
D
3
0
Final PySpark Dataframe:
Group
Month
Amount
Flag
A
1
0
1
A
2
0
1
A
3
35
1
A
4
0
1
A
5
0
1
B
1
0
0
B
2
0
0
C
1
0
0
C
2
0
0
C
3
0
0
C
4
13
0
D
1
0
1
D
2
24
1
D
3
0
1
Basically, what I want is for each group, to sum the amount of the first 3 months. If that sum is greater than 0, the flag is 1 for all the elements of the group, and otherwise is 0.
You can create the flag column by applying a Window function. Create a psuedo-column which becomes 1 if the criteria is met and then finally sum over the psuedo-column and if it's greater than 0, then there was atleast once row that met the criteria and set the flag to 1.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("A", 1, 0, ),
("A", 2, 0, ),
("A", 3, 35, ),
("A", 4, 0, ),
("A", 5, 0, ),
("B", 1, 0, ),
("B", 2, 0, ),
("C", 1, 0, ),
("C", 2, 0, ),
("C", 3, 0, ),
("C", 4, 13, ),
("D", 1, 0, ),
("D", 2, 24, ),
("D", 3, 0, ), ]
df = spark.createDataFrame(data, ("Group", "Month", "Amount", ))
ws = W.partitionBy("Group").orderBy("Month").rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
criteria = F.when((F.col("Month") < 4) & (F.col("Amount") > 0), F.lit(1)).otherwise(F.lit(0))
(df.withColumn("flag", F.when(F.sum(criteria).over(ws) > 0, F.lit(1)).otherwise(F.lit(0)))
).show()
"""
+-----+-----+------+----+
|Group|Month|Amount|flag|
+-----+-----+------+----+
| A| 1| 0| 1|
| A| 2| 0| 1|
| A| 3| 35| 1|
| A| 4| 0| 1|
| A| 5| 0| 1|
| B| 1| 0| 0|
| B| 2| 0| 0|
| C| 1| 0| 0|
| C| 2| 0| 0|
| C| 3| 0| 0|
| C| 4| 13| 0|
| D| 1| 0| 1|
| D| 2| 24| 1|
| D| 3| 0| 1|
+-----+-----+------+----+
"""
You can use Window function with count and when.
w = Window.partitionBy('Group')
df = df.withColumn('Flag', F.count(
F.when((F.col('Month') < 4) & (F.col('Amount') > 0), True)).over(w))
.withColumn('Flag', F.when(F.col('Flag') > 0, 1).otherwise(0))

Spark Scala create a new column which contains addition of previous balance amount for each cid

Initial DF:
cid transAmt trasnDate
1 10 2-Aug
1 20 3-Aug
1 30 3-Aug
2 40 2-Aug
2 50 3-Aug
3 60 4-Aug
Output DF:
cid transAmt trasnDate sumAmt
1 10 2-Aug **10**
1 20 3-Aug **30**
1 30 3-Aug **60**
2 40 2-Aug **40**
2 50 3-Aug **90**
3 60 4-Aug **60**
I need a new column as sumAmt which has the addition for each cid
Use window sum function to get the cumulative sum.
Example:
df.show()
//+---+------+----------+
//|cid|Amount|transnDate|
//+---+------+----------+
//| 1| 10| 2-Aug|
//| 1| 20| 3-Aug|
//| 2| 40| 2-Aug|
//| 2| 50| 3-Aug|
//| 3| 60| 4-Aug|
//+---+------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
val w= Window.partitionBy("cid").orderBy("Amount","transnDate")
df.withColumn("sumAmt",sum(col("Amount")).over(w)).show()
//+---+------+----------+------+
//|cid|Amount|transnDate|sumAmt|
//+---+------+----------+------+
//| 1| 10| 2-Aug| 10|
//| 1| 20| 3-Aug| 30|
//| 3| 60| 4-Aug| 60|
//| 2| 40| 2-Aug| 40|
//| 2| 50| 3-Aug| 90|
//+---+------+----------+------+
Just use a simple window indicating rows between.
Window.unboundedPreceding meaning no lower limit
Window.currentRow meaning current row (pretty obvious)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val cidCategory = Window.partitionBy("cid")
.orderBy("transDate")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val result = df.withColumn("sumAmt", sum($"transAmt").over(cidCategory))
OUTPUT

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Recursively Generate Hierarchy Data set using spark scala functional programming

I am bit new to functional programming. How to generate the below sequence of data.
Below is the input dataset of the following columns:
INPUT
ID PARENT_ID AMT NAME
1 none 1000 A
2 1 -5000 B
3 2 -2000 C
5 3 7000 D
6 4 -7000 E
4 none 7000 F
OUTPUT
ID PARENT_ID AMT AMT_1 AMT_2 AMT_3 NAME_1 ...
1 none 1000 none none none none
2 1 -5000 1000 none none A
3 2 -2000 -5000 1000 none B
4 none 7000 none none none none
5 3 7000 -2000 -5000 1000 C
6 4 -7000 7000 none none D
Here's one way to perform the recursive join up to a specific level:
import org.apache.spark.sql.functions._
val df = Seq(
(Some(1), None, Some(1000), Some("A")),
(Some(2), Some(1), Some(-5000), Some("B")),
(Some(3), Some(2), Some(-2000), Some("C")),
(Some(4), None, Some(7000), Some("D")),
(Some(5), Some(3), Some(7000), Some("E")),
(Some(6), Some(4), Some(-7000), Some("F"))
).toDF("id", "parent_id", "amt", "name")
val nestedLevel = 3
(1 to nestedLevel).foldLeft( df.as("d0") ){ (accDF, i) =>
val j = i - 1
accDF.join(df.as(s"d$i"), col(s"d$j.parent_id") === col(s"d$i.id"), "left_outer")
}.
select(
col("d0.id") :: col("d0.parent_id") ::
col("d0.amt").as("amt") :: col("d0.name").as("name") :: (
(1 to nestedLevel).toList.map(i => col(s"d$i.amt").as(s"amt_$i")) :::
(1 to nestedLevel).toList.map(i => col(s"d$i.name").as(s"name_$i"))
): _*
).
show
// +---+---------+-----+----+-----+-----+-----+------+------+------+
// | id|parent_id| amt|name|amt_1|amt_2|amt_3|name_1|name_2|name_3|
// +---+---------+-----+----+-----+-----+-----+------+------+------+
// | 1| null| 1000| A| null| null| null| null| null| null|
// | 2| 1|-5000| B| 1000| null| null| A| null| null|
// | 3| 2|-2000| C|-5000| 1000| null| B| A| null|
// | 4| null| 7000| D| null| null| null| null| null| null|
// | 5| 3| 7000| E|-2000|-5000| 1000| C| B| A|
// | 6| 4|-7000| F| 7000| null| null| D| null| null|
// +---+---------+-----+----+-----+-----+-----+------+------+------+

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.
For example,
Suppose we have the following dataframe:
Category Color Number Letter
1 Red 4 A
1 Yellow Null B
3 Green 8 C
2 Blue Null A
1 Green 9 A
3 Green 8 B
3 Yellow Null C
2 Blue 9 B
3 Blue 8 B
1 Blue Null Null
1 Red 7 C
2 Green Null C
1 Yellow 7 Null
3 Red Null B
Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).
So the output would ideally be:
Category Color CountNumber(Non-Nulls) Size MeanNumber ModeNumber ModeCountNumber CountLetter(Non-Nulls) ModeLetter ModeCountLetter
1 Red 2 2 5.5 4 (or 7)
1 Yellow 1 2 7 7
1 Green 1 1 9 9
1 Blue 1 1 - -
2 Blue 1 2 9 9 etc
2 Green - 1 - -
3 Green 2 2 8 8
3 Yellow - 1 - -
3 Blue 1 1 8 8
3 Red - 1 - -
This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.
Thanks.
As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:
// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()
// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
$"base.Category" === $"max.Category" and
$"base.Color" === $"max.Color" and
$"base.count" === $"max._max")
.select($"base.Category", $"base.Color", $"base.Number", $"_max")
.groupBy("Category", "Color")
.agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
.where($"ModeNumber".isNotNull)
// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
count("Color") as "Size", // counting a key column -> includes nulls
count("Number") as "CountNumber", // does not include nulls
mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")
result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// | 3|Yellow| 1| 0| null| null| null|
// | 1| Green| 1| 1| 9.0| 9| 1|
// | 1| Red| 2| 2| 5.5| 7| 1|
// | 2| Green| 1| 0| null| null| null|
// | 3| Blue| 1| 1| 8.0| 8| 1|
// | 1|Yellow| 2| 1| 7.0| 7| 1|
// | 2| Blue| 2| 1| 9.0| 9| 1|
// | 3| Green| 2| 2| 8.0| 8| 2|
// | 1| Blue| 1| 0| null| null| null|
// | 3| Red| 1| 0| null| null| null|
// +--------+------+----+-----------+----------+----------+---------------+
As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...
As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.