I have a below requirement to aggregate the data on Spark dataframe in scala.
I have a spark dataframe with two columns.
mo_id sales
201601 11.01
201602 12.01
201603 13.01
201604 14.01
201605 15.01
201606 16.01
201607 17.01
201608 18.01
201609 19.01
201610 20.01
201611 21.01
201612 22.01
As shown above the dataframe has two columns 'mo_id' and 'sales'.
I want to add a new column (agg_sales)to the dataframe which should have the sum of sales upto the current month like as shown below.
mo_id sales agg_sales
201601 10 10
201602 20 30
201603 30 60
201604 40 100
201605 50 150
201606 60 210
201607 70 280
201608 80 360
201609 90 450
201610 100 550
201611 110 660
201612 120 780
Description:
For the month 201603 agg_sales will be sum of sales from 201601 to 201603.
For the month 201604 agg_sales will be sum of sales from 201601 to 201604.
and so on.
Can anyone please help to do this.
Versions using : Spark 1.6.2 and Scala 2.10
You are looking for a cumulative sum which can be accomplished with a window function:
scala> val df = sc.parallelize(Seq((201601, 10), (201602, 20), (201603, 30), (201604, 40), (201605, 50), (201606, 60), (201607, 70), (201608, 80), (201609, 90), (201610, 100), (201611, 110), (201612, 120))).toDF("id","sales")
df: org.apache.spark.sql.DataFrame = [id: int, sales: int]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val ordering = Window.orderBy("id")
ordering: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#75d454a4
scala> df.withColumn("agg_sales", sum($"sales").over(ordering)).show
16/12/27 21:11:35 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+------+-----+-------------+
| id|sales| agg_sales |
+------+-----+-------------+
|201601| 10| 10|
|201602| 20| 30|
|201603| 30| 60|
|201604| 40| 100|
|201605| 50| 150|
|201606| 60| 210|
|201607| 70| 280|
|201608| 80| 360|
|201609| 90| 450|
|201610| 100| 550|
|201611| 110| 660|
|201612| 120| 780|
+------+-----+-------------+
Note that I defined the ordering on the ids, you would probably want some sort of time stamp to order the summation.
Related
I want every 3 month of average sales in pyspark .
Input
Input:
Product Date Sales
A 01/04/2020 50
A 02/04/2020 60
A 01/05/2020 70
A 05/05/2020 80
A 10/06/2020 100
A 13/06/2020 150
A 25/07/2020 160
output:output
Product Date Sales 3month Avg sales
A 01/04/2020 50 36.67
A 02/04/2020 60 36.67
A 01/05/2020 70 86.67
A 05/05/2020 80 86.67
A 10/06/2020 100 170
A 13/06/2020 150 170
A 25/07/2020 160 186.67
Avg of july is sales of (may+june+july)/3=560/3=186.67
Sometimes, the dense_rank is quite expensive and so I have calculated the custom index and to similar steps with #Cena.
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('Product').orderBy('index').rangeBetween(-2, 0)
df.withColumn('Date', to_date('Date', 'dd/MM/yyyy')) \
.withColumn('index', (year('Date') - 2020) * 12 + month('Date')) \
.withColumn('avg', sum('Sales').over(w) / 3) \
.show()
+-------+----------+-----+-----+------------------+
|Product| Date|Sales|index| avg|
+-------+----------+-----+-----+------------------+
| A|2020-04-01| 50| 4|36.666666666666664|
| A|2020-04-02| 60| 4|36.666666666666664|
| A|2020-05-01| 70| 5| 86.66666666666667|
| A|2020-05-05| 80| 5| 86.66666666666667|
| A|2020-06-10| 100| 6| 170.0|
| A|2020-06-13| 150| 6| 170.0|
| A|2020-07-25| 160| 7|186.66666666666666|
+-------+----------+-----+-----+------------------+
You can use dense_rank() over the month column to compute the moving average. Cast the date and extract the month from it. dense_rank() rank over the month gives you consecutive ranks.
For the moving average, you can use rangeBetween(-2, 0) to look back 2 months from the current month. Sum by sales and divide by 3 for the output.
Your df:
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
from pyspark.sql.window import Window
row = Row("Product", "Date", "Sales")
df = sc.parallelize([row("A", "01/04/2020", 50),row("A", "02/04/2020", 60),row("A", "01/05/2020", 70),row("A", "05/05/2020", 80),row("A", "10/06/2020", 100),row("A", "13/06/2020", 150),row("A", "25/07/2020", 160)]).toDF()
df = df.withColumn('date_cast', from_unixtime(unix_timestamp('Date', 'dd/MM/yyyy')).cast(DateType()))
df = df.withColumn('month', month("date_cast"))
w=Window().partitionBy("Product").orderBy("month")
df = df.withColumn('rank', F.dense_rank().over(w))
w2 = (Window().partitionBy(col("Product")).orderBy("rank").rangeBetween(-2, 0))
df.select(col("*"), ((F.sum("Sales").over(w2))/3).alias("mean"))\
.drop("date_cast", "month", "rank").show()
Output:
+-------+----------+-----+------------------+
|Product| Date|Sales| mean|
+-------+----------+-----+------------------+
| A|01/04/2020| 50|36.666666666666664|
| A|02/04/2020| 60|36.666666666666664|
| A|01/05/2020| 70| 86.66666666666667|
| A|05/05/2020| 80| 86.66666666666667|
| A|10/06/2020| 100| 170.0|
| A|13/06/2020| 150| 170.0|
| A|25/07/2020| 160|186.66666666666666|
+-------+----------+-----+------------------+
I have input Dataframe and have to produce output Dataframe.
On input Dataframe, I have to group several columns and if that group has sum of another column some value for that group then I have to update one column for each member of that group with x.
So I will get several groups and have to update one of their columns with x and for rows that don’t get in any group value in that column must not be changed.
Like:
Job id , job name, department, age, old.
First 3 columns are grouped, sum(age) = 100 then old gets x for all rows in group
And their will be several groups.
And output Dataframe will have same number of rows as input one.
val dfIn = job id , job name , department , age , old
24 Dev Sales 30 0
24 Dev Sales 40 0
24 Dev Sales 20 0
24 Dev Sales 10 0
24 Dev HR 30 0
24 Dev HR 20 0
24 Dev Retail 50 0
24 Dev Retail 50 0
val dfOut= job id , job name , department , age , old
24 Dev Sales 30 x
24 Dev Sales 40 x
24 Dev Sales 20 x
24 Dev Sales 10 x
24 Dev HR 30 0
24 Dev HR 20 0
24 Dev Retail 50 x
24 Dev Retail 50 x
Just calculate sum_age using Window function and use when/otherwise to affect X to old column when sum_age = 100 otherwise keep same value 0.
import org.apache.spark.sql.expressions.Window
val df = Seq(
(24, "Dev", "Sales", 30, "0"), (24, "Dev", "Sales", 40, "0"),
(24, "Dev", "Sales", 20, "0"), (24, "Dev", "Sales", 10, "0"),
(24, "Dev", "HR", 30, "0"), (24, "Dev", "HR", 20, "0"),
(24, "Dev", "Retail", 50, "0"), (24, "Dev", "Retail", 50, "0")
).toDF("job_id", "job_name", "department", "age", "old")
val w = Window.partitionBy($"job_id", $"job_name", $"department").orderBy($"job_id")
val dfOut = df.withColumn("sum_age", sum(col("age")).over(w))
.withColumn("old", when($"sum_age" === lit(100), lit("X")).otherwise($"old"))
.drop($"sum_age")
dfOut.show()
+------+--------+----------+---+---+
|job_id|job_name|department|age|old|
+------+--------+----------+---+---+
| 24| Dev| HR| 30| 0|
| 24| Dev| HR| 20| 0|
| 24| Dev| Retail| 50| X|
| 24| Dev| Retail| 50| X|
| 24| Dev| Sales| 30| X|
| 24| Dev| Sales| 40| X|
| 24| Dev| Sales| 20| X|
| 24| Dev| Sales| 10| X|
+------+--------+----------+---+---+
I want to compare Prev.data with Current data by monthly . I am having data like below.
Data-set 1 : (Prev) Data-set 2 : (Latest)
Year-month Sum-count Year-Month Sum-count
-- -- 201808 48
201807 30 201807 22
201806 20 201806 20
201805 35 201805 20
201804 12 201804 9
201803 15 -- --
I have data sets like as shown above. I want to compare the both data sets based on year-month column and sum-count and need to find out difference in percentage.
I am using spark 2.3.0 and Scala 2.11.
Here is mode :
import org.apache.spark.sql.functions.lag
val mdf = spark.read.format("csv").
option("InferSchema","true").
option("header","true").
option("delimiter",",").
option("charset","utf-8").
load("c:\\test.csv")
mdf.createOrReplaceTempView("test")
val res= spark.sql("select year-month,SUM(Sum-count) as SUM_AMT from test d group by year-month")
val win = org.apache.spark.sql.expressions.Window.orderBy("data_ym")
val res1 = res.withColumn("Prev_month", lag("SUM_AMT", 1,0).over(win)).withColumn("percentage",col("Prev_month") / sum("SUM_AMT").over()).show()
I need output like this :
if percentage is more than 10% then i need to set flag as F.
set1 cnt set2 cnt output(Percentage) Flag
201807 30 201807 22 7% T
201806 20 201806 20 0% T
201805 35 201805 20 57% F
Please help me on this.
Can be done in such way:
val data1 = List(
("201807", 30),
("201806", 20),
("201805", 35),
("201804", 12),
("201803", 15)
)
val data2 = List(
("201808", 48),
("201807", 22),
("201806", 20),
("201805", 20),
("201804", 9)
)
val df1 = data1.toDF("Year-month", "Sum-count")
val df2 = data2.toDF("Year-month", "Sum-count")
val joined = df1.alias("df1").join(df2.alias("df2"), "Year-month")
joined
.withColumn("output(Percentage)", abs($"df1.Sum-count" - $"df2.Sum-count").divide($"df1.Sum-count"))
.withColumn("Flag", when($"output(Percentage)" > 0.1, "F").otherwise("T"))
.show(false)
Output:
+----------+---------+---------+-------------------+----+
|Year-month|Sum-count|Sum-count|output(Percentage) |Flag|
+----------+---------+---------+-------------------+----+
|201807 |30 |22 |0.26666666666666666|F |
|201806 |20 |20 |0.0 |T |
|201805 |35 |20 |0.42857142857142855|F |
|201804 |12 |9 |0.25 |F |
+----------+---------+---------+-------------------+----+
Here's my solution :
val values1 = List(List("1201807", "30")
,List("1201806", "20") ,
List("1201805", "35"),
List("1201804","12"),
List("1201803","15")
).map(x =>(x(0), x(1)))
val values2 = List(List("201808", "48")
,List("1201807", "22") ,
List("1201806", "20"),
List("1201805","20"),
List("1201804","9")
).map(x =>(x(0), x(1)))
import spark.implicits._
import org.apache.spark.sql.functions
val df1 = values1.toDF
val df2 = values2.toDF
df1.join(df2, Seq("_1"), "full").toDF("set", "cnt1", "cnt2")
.withColumn("percentage1", col("cnt1")/sum("cnt1").over() * 100)
.withColumn("percentage2", col("cnt2")/sum("cnt2").over() * 100)
.withColumn("percentage", abs(col("percentage2") - col("percentage1")))
.withColumn("flag", when(col("percentage") > 10, "F").otherwise("T")).na.drop().show()
Here's the result :
+-------+----+----+------------------+------------------+------------------+----+
| set|cnt1|cnt2| percentage1| percentage2| percentage|flag|
+-------+----+----+------------------+------------------+------------------+----+
|1201804| 12| 9|10.714285714285714| 7.563025210084033| 3.15126050420168| T|
|1201807| 30| 22|26.785714285714285|18.487394957983195| 8.29831932773109| T|
|1201806| 20| 20|17.857142857142858| 16.80672268907563|1.0504201680672267| T|
|1201805| 35| 20| 31.25| 16.80672268907563|14.443277310924369| F|
+-------+----+----+------------------+------------------+------------------+----+
I hope it helps :)
I need to group by id and times and show max date
Id Key Times date
20 40 1 20190323
20 41 1 20191201
31 33 3 20191209
My output should be:
Id Key Times date
20 41 1 20191201
31 33 3 20191209
You can simply apply groupBy function to group by Id and then join with original dataset to get Key column to you resulting dataframe. Try following code,
//your original dataframe
val df = Seq((20,40,1,20190323),(20,41,1,20191201),(31,33,3,20191209))
.toDF("Id","Key","Times","date")
df.show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 20| 40| 1|20190323|
//| 20| 41| 1|20191201|
//| 31| 33| 3|20191209|
//+---+---+-----+--------+
//group by Id column
val maxDate = df.groupBy("Id").agg(max("date").as("maxdate"))
//join with original DF to get rest of the column
maxDate.join(df, Seq("Id"))
.where($"date" === $"maxdate")
.select("Id","Key","Times","date").show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 31| 33| 3|20191209|
//| 20| 41| 1|20191201|
//+---+---+-----+--------+
Eg: I would like to add the quantity sold by the date.
Date Quantity
11/4/2017 20
11/4/2017 23
11/4/2017 12
11/5/2017 18
11/5/2017 12
Output with the new Column:
Date Quantity, New_Column
11/4/2017 20 55
11/4/2017 23 55
11/4/2017 12 55
11/5/2017 18 30
11/5/2017 12 30
Simply use sum as a window function by specifying a WindowSpec:
import org.apache.spark.sql.expressions.Window
df.withColumn("New_Column", sum("Quantity").over(Window.partitionBy("Date"))).show
+---------+--------+----------+
| Date|Quantity|New_Column|
+---------+--------+----------+
|11/5/2017| 18| 30|
|11/5/2017| 12| 30|
|11/4/2017| 20| 55|
|11/4/2017| 23| 55|
|11/4/2017| 12| 55|
+---------+--------+----------+