How to do cumulative sum based on conditions in spark scala

How to do cumulative sum based on conditions in spark scala - scala

I have the below data and final_column is the exact output what I am trying to get. I am trying to do cumulative sum of flag and want to rest if flag is 0 then set value to 0 as below data
cola date flag final_column
a 2021-10-01 0 0
a 2021-10-02 1 1
a 2021-10-03 1 2
a 2021-10-04 0 0
a 2021-10-05 0 0
a 2021-10-06 0 0
a 2021-10-07 1 1
a 2021-10-08 1 2
a 2021-10-09 1 3
a 2021-10-10 0 0
b 2021-10-01 0 0
b 2021-10-02 1 1
b 2021-10-03 1 2
b 2021-10-04 0 0
b 2021-10-05 0 0
b 2021-10-06 1 1
b 2021-10-07 1 2
b 2021-10-08 1 3
b 2021-10-09 1 4
b 2021-10-10 0 0
I have tried like
import org.apache.spark.sql.functions._
df.withColumn("final_column",expr("sum(flag) over(partition by cola order date asc)"))
I have tried to add condition like case when flag = 0 then 0 else 1 end inside sum function but not working.

You can define a column group using conditional sum on flag, then using row_number with a Window partitioned by cola and group gives the result you want:
import org.apache.spark.sql.expressions.Window
val result = df.withColumn(
"group",
sum(when(col("flag") === 0, 1).otherwise(0)).over(Window.partitionBy("cola").orderBy("date"))
).withColumn(
"final_column",
row_number().over(Window.partitionBy("cola", "group").orderBy("date")) - 1
).drop("group")
result.show
//+----+-----+----+------------+
//|cola| date|flag|final_column|
//+----+-----+----+------------+
//| b|44201| 0| 0|
//| b|44202| 1| 1|
//| b|44203| 1| 2|
//| b|44204| 0| 0|
//| b|44205| 0| 0|
//| b|44206| 1| 1|
//| b|44207| 1| 2|
//| b|44208| 1| 3|
//| b|44209| 1| 4|
//| b|44210| 0| 0|
//| a|44201| 0| 0|
//| a|44202| 1| 1|
//| a|44203| 1| 2|
//| a|44204| 0| 0|
//| a|44205| 0| 0|
//| a|44206| 0| 0|
//| a|44207| 1| 1|
//| a|44208| 1| 2|
//| a|44209| 1| 3|
//| a|44210| 0| 0|
//+----+-----+----+------------+
row_number() - 1 in this case is just equivalent to sum(col("flag")) as flag values are always 0 or 1. So the above final_column can also be written as:
.withColumn(
"final_column",
sum(col("flag")).over(Window.partitionBy("cola", "group").orderBy("date"))
)

Related

Window function with PySpark

I have a PySpark Dataframe and my goal is to create a Flag column whose value depends on the value of the Amount column.
Basically, for each Group, I want to know if in any of the first three months, there is an amount greater than 0 and if that is the case, the value of the Flag column will be 1 for all the group, otherwise the value will be 0.
I will include an example to clarify a bit better.
Initial PySpark Dataframe:
Group
Month
Amount
A
1
0
A
2
0
A
3
35
A
4
0
A
5
0
B
1
0
B
2
0
C
1
0
C
2
0
C
3
0
C
4
13
D
1
0
D
2
24
D
3
0
Final PySpark Dataframe:
Group
Month
Amount
Flag
A
1
0
1
A
2
0
1
A
3
35
1
A
4
0
1
A
5
0
1
B
1
0
0
B
2
0
0
C
1
0
0
C
2
0
0
C
3
0
0
C
4
13
0
D
1
0
1
D
2
24
1
D
3
0
1
Basically, what I want is for each group, to sum the amount of the first 3 months. If that sum is greater than 0, the flag is 1 for all the elements of the group, and otherwise is 0.

You can create the flag column by applying a Window function. Create a psuedo-column which becomes 1 if the criteria is met and then finally sum over the psuedo-column and if it's greater than 0, then there was atleast once row that met the criteria and set the flag to 1.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
data = [("A", 1, 0, ),
("A", 2, 0, ),
("A", 3, 35, ),
("A", 4, 0, ),
("A", 5, 0, ),
("B", 1, 0, ),
("B", 2, 0, ),
("C", 1, 0, ),
("C", 2, 0, ),
("C", 3, 0, ),
("C", 4, 13, ),
("D", 1, 0, ),
("D", 2, 24, ),
("D", 3, 0, ), ]
df = spark.createDataFrame(data, ("Group", "Month", "Amount", ))
ws = W.partitionBy("Group").orderBy("Month").rowsBetween(W.unboundedPreceding, W.unboundedFollowing)
criteria = F.when((F.col("Month") < 4) & (F.col("Amount") > 0), F.lit(1)).otherwise(F.lit(0))
(df.withColumn("flag", F.when(F.sum(criteria).over(ws) > 0, F.lit(1)).otherwise(F.lit(0)))
).show()
"""
+-----+-----+------+----+
|Group|Month|Amount|flag|
+-----+-----+------+----+
| A| 1| 0| 1|
| A| 2| 0| 1|
| A| 3| 35| 1|
| A| 4| 0| 1|
| A| 5| 0| 1|
| B| 1| 0| 0|
| B| 2| 0| 0|
| C| 1| 0| 0|
| C| 2| 0| 0|
| C| 3| 0| 0|
| C| 4| 13| 0|
| D| 1| 0| 1|
| D| 2| 24| 1|
| D| 3| 0| 1|
+-----+-----+------+----+
"""

You can use Window function with count and when.
w = Window.partitionBy('Group')
df = df.withColumn('Flag', F.count(
F.when((F.col('Month') < 4) & (F.col('Amount') > 0), True)).over(w))
.withColumn('Flag', F.when(F.col('Flag') > 0, 1).otherwise(0))

Spark Scala create a new column which contains addition of previous balance amount for each cid

Initial DF:
cid transAmt trasnDate
1 10 2-Aug
1 20 3-Aug
1 30 3-Aug
2 40 2-Aug
2 50 3-Aug
3 60 4-Aug
Output DF:
cid transAmt trasnDate sumAmt
1 10 2-Aug **10**
1 20 3-Aug **30**
1 30 3-Aug **60**
2 40 2-Aug **40**
2 50 3-Aug **90**
3 60 4-Aug **60**
I need a new column as sumAmt which has the addition for each cid

Use window sum function to get the cumulative sum.
Example:
df.show()
//+---+------+----------+
//|cid|Amount|transnDate|
//+---+------+----------+
//| 1| 10| 2-Aug|
//| 1| 20| 3-Aug|
//| 2| 40| 2-Aug|
//| 2| 50| 3-Aug|
//| 3| 60| 4-Aug|
//+---+------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
val w= Window.partitionBy("cid").orderBy("Amount","transnDate")
df.withColumn("sumAmt",sum(col("Amount")).over(w)).show()
//+---+------+----------+------+
//|cid|Amount|transnDate|sumAmt|
//+---+------+----------+------+
//| 1| 10| 2-Aug| 10|
//| 1| 20| 3-Aug| 30|
//| 3| 60| 4-Aug| 60|
//| 2| 40| 2-Aug| 40|
//| 2| 50| 3-Aug| 90|
//+---+------+----------+------+

Just use a simple window indicating rows between.
Window.unboundedPreceding meaning no lower limit
Window.currentRow meaning current row (pretty obvious)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val cidCategory = Window.partitionBy("cid")
.orderBy("transDate")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val result = df.withColumn("sumAmt", sum($"transAmt").over(cidCategory))
OUTPUT

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark

Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|

Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Recursively Generate Hierarchy Data set using spark scala functional programming

I am bit new to functional programming. How to generate the below sequence of data.
Below is the input dataset of the following columns:
INPUT
ID PARENT_ID AMT NAME
1 none 1000 A
2 1 -5000 B
3 2 -2000 C
5 3 7000 D
6 4 -7000 E
4 none 7000 F
OUTPUT
ID PARENT_ID AMT AMT_1 AMT_2 AMT_3 NAME_1 ...
1 none 1000 none none none none
2 1 -5000 1000 none none A
3 2 -2000 -5000 1000 none B
4 none 7000 none none none none
5 3 7000 -2000 -5000 1000 C
6 4 -7000 7000 none none D

Here's one way to perform the recursive join up to a specific level:
import org.apache.spark.sql.functions._
val df = Seq(
(Some(1), None, Some(1000), Some("A")),
(Some(2), Some(1), Some(-5000), Some("B")),
(Some(3), Some(2), Some(-2000), Some("C")),
(Some(4), None, Some(7000), Some("D")),
(Some(5), Some(3), Some(7000), Some("E")),
(Some(6), Some(4), Some(-7000), Some("F"))
).toDF("id", "parent_id", "amt", "name")
val nestedLevel = 3
(1 to nestedLevel).foldLeft( df.as("d0") ){ (accDF, i) =>
val j = i - 1
accDF.join(df.as(s"d$i"), col(s"d$j.parent_id") === col(s"d$i.id"), "left_outer")
}.
select(
col("d0.id") :: col("d0.parent_id") ::
col("d0.amt").as("amt") :: col("d0.name").as("name") :: (
(1 to nestedLevel).toList.map(i => col(s"d$i.amt").as(s"amt_$i")) :::
(1 to nestedLevel).toList.map(i => col(s"d$i.name").as(s"name_$i"))
): _*
).
show
// +---+---------+-----+----+-----+-----+-----+------+------+------+
// | id|parent_id| amt|name|amt_1|amt_2|amt_3|name_1|name_2|name_3|
// +---+---------+-----+----+-----+-----+-----+------+------+------+
// | 1| null| 1000| A| null| null| null| null| null| null|
// | 2| 1|-5000| B| 1000| null| null| A| null| null|
// | 3| 2|-2000| C|-5000| 1000| null| B| A| null|
// | 4| null| 7000| D| null| null| null| null| null| null|
// | 5| 3| 7000| E|-2000|-5000| 1000| C| B| A|
// | 6| 4|-7000| F| 7000| null| null| D| null| null|
// +---+---------+-----+----+-----+-----+-----+------+------+------+

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

I am trying to use Spark (Scala) dataframes to do groupby aggregates for mode and the corresponding count.
For example,
Suppose we have the following dataframe:
Category Color Number Letter
1 Red 4 A
1 Yellow Null B
3 Green 8 C
2 Blue Null A
1 Green 9 A
3 Green 8 B
3 Yellow Null C
2 Blue 9 B
3 Blue 8 B
1 Blue Null Null
1 Red 7 C
2 Green Null C
1 Yellow 7 Null
3 Red Null B
Now we want to group by Category, then Color, and then find the size of the grouping, count of number non-nulls, the total size of number, the mean of number, the mode of number, and the corresponding mode count. For letter I'd like the count of non-nulls and the corresponding mode and mode count (no mean since this is a string).
So the output would ideally be:
Category Color CountNumber(Non-Nulls) Size MeanNumber ModeNumber ModeCountNumber CountLetter(Non-Nulls) ModeLetter ModeCountLetter
1 Red 2 2 5.5 4 (or 7)
1 Yellow 1 2 7 7
1 Green 1 1 9 9
1 Blue 1 1 - -
2 Blue 1 2 9 9 etc
2 Green - 1 - -
3 Green 2 2 8 8
3 Yellow - 1 - -
3 Blue 1 1 8 8
3 Red - 1 - -
This is easy to do for the count and mean but more tricky for everything else. Any advice would be appreciated.
Thanks.

As far as I know - there's no simple way to compute mode - you have to count the occurrences of each value and then join the result with the maximum (per key) of that result. The rest of the computations are rather straight-forward:
// count occurrences of each number in its category and color
val numberCounts = df.groupBy("Category", "Color", "Number").count().cache()
// compute modes for Number - joining counts with the maximum count per category and color:
val modeNumbers = numberCounts.as("base").join(numberCounts.groupBy("Category", "Color").agg(max("count") as "_max").as("max"),
$"base.Category" === $"max.Category" and
$"base.Color" === $"max.Color" and
$"base.count" === $"max._max")
.select($"base.Category", $"base.Color", $"base.Number", $"_max")
.groupBy("Category", "Color")
.agg(first($"Number", ignoreNulls = true) as "ModeNumber", first("_max") as "ModeCountNumber")
.where($"ModeNumber".isNotNull)
// now compute Size, Count and Mean (simple) and join to add Mode:
val result = df.groupBy("Category", "Color").agg(
count("Color") as "Size", // counting a key column -> includes nulls
count("Number") as "CountNumber", // does not include nulls
mean("Number") as "MeanNumber"
).join(modeNumbers, Seq("Category", "Color"), "left")
result.show()
// +--------+------+----+-----------+----------+----------+---------------+
// |Category| Color|Size|CountNumber|MeanNumber|ModeNumber|ModeCountNumber|
// +--------+------+----+-----------+----------+----------+---------------+
// | 3|Yellow| 1| 0| null| null| null|
// | 1| Green| 1| 1| 9.0| 9| 1|
// | 1| Red| 2| 2| 5.5| 7| 1|
// | 2| Green| 1| 0| null| null| null|
// | 3| Blue| 1| 1| 8.0| 8| 1|
// | 1|Yellow| 2| 1| 7.0| 7| 1|
// | 2| Blue| 2| 1| 9.0| 9| 1|
// | 3| Green| 2| 2| 8.0| 8| 2|
// | 1| Blue| 1| 0| null| null| null|
// | 3| Red| 1| 0| null| null| null|
// +--------+------+----+-----------+----------+----------+---------------+
As you can imagine - this might be slow, as it has 4 groupBys and two joins - all requiring shuffles...
As for the Letter column statistics - I'm afraid you'll have to repeat this for that column separately and add another join.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to do cumulative sum based on conditions in spark scala - scala

Related

Window function with PySpark

Spark Scala create a new column which contains addition of previous balance amount for each cid

How to compute cumulative sum on multiple float columns?

Recursively Generate Hierarchy Data set using spark scala functional programming

Find Most Common Value and Corresponding Count Using Spark Groupby Aggregates

Categories

Resources