Creating new column with values of looping through SparkSQL Dataframe - scala

I have sparkSQL Dataframe which contains unique code, monthdate and number of turnover. I want to loop over each monthdate to get the sum of the turnover in 12 months for turnover_per_year. For example if the monthdate is January 2022 then the sum will be from January 2021 to January 2022.
For example if i have data in 2021 and 2022. The calculation of turnover_per_year in january 1st 2022 = turnover Jan 2021+ turnover feb 2021 + turnover march 2021 + ... + turnover dec 2021 + turnover jan 2022 but if i want to get the turnover_per_year in January 2021 then it will be null because i don't have the data in 2020.
This is the sample of the dataframe
+------+------------+----------+
| code | monthdate | turnover |
+------+------------+----------+
| AA1 | 2021-01-01 | 10 |
+------+------------+----------+
| AA1 | 2021-02-01 | 20 |
+------+------------+----------+
| AA1 | 2021-03-01 | 30 |
+------+------------+----------+
| AA1 | 2021-04-01 | 40 |
+------+------------+----------+
| AA1 | 2021-05-01 | 50 |
+------+------------+----------+
| AA1 | 2021-06-01 | 60 |
+------+------------+----------+
| AA1 | 2021-07-01 | 70 |
+------+------------+----------+
| AA1 | 2021-08-01 | 80 |
+------+------------+----------+
| AA1 | 2021-09-01 | 90 |
+------+------------+----------+
| AA1 | 2021-10-01 | 100 |
+------+------------+----------+
| AA1 | 2021-11-01 | 101 |
+------+------------+----------+
| AA1 | 2021-12-01 | 102 |
+------+------------+----------+
| AA1 | 2022-01-01 | 103 |
+------+------------+----------+
| AA1 | 2022-02-01 | 104 |
+------+------------+----------+
| AA1 | 2022-03-01 | 105 |
+------+------------+----------+
| AA1 | 2022-04-01 | 106 |
+------+------------+----------+
| AA1 | 2022-05-01 | 107 |
+------+------------+----------+
| AA1 | 2022-06-01 | 108 |
+------+------------+----------+
| AA1 | 2022-07-01 | 109 |
+------+------------+----------+
| AA1 | 2022-08-01 | 110 |
+------+------------+----------+
| AA1 | 2022-09-01 | 111 |
+------+------------+----------+
| AA1 | 2022-10-01 | 112 |
+------+------------+----------+
| AA1 | 2022-11-01 | 113 |
+------+------------+----------+
| AA1 | 2022-12-01 | 114 |
+------+------------+----------+
I'm very new to spark and scala and it's confusing for me to solve this in spark scala way. I have developed the logic but have difficulties to translate it to spark scala. I'm working on cluster mode. Here's my logic.
listkey = df.select("code").distinct.map(r => r(0)).collect())
listkey.foreach(key=>
df.select(*).filter("code==${key}").oderBy("monthdate").foreach(
row=>
var monthdate = row.monthdate
var turnover = row.turnover
var sum = 0
sum = sum + turnover
var n = 1
var i = 1
while (n<12){
var monthdate_temp = datetime-i
var turnover_temp =
df.select("turnover").filter("monthdate=${monthdate_temp} and code =${key}").collect()
sum = sum+turnover_temp
n =n+1
i = i+1
}
row = row.withColumn("turnover_per_year",sum)
)
)
Any help will be appreciated, thanks in advance

Each row in original dataframe can be expanded to 12 rows with back dates by "explode" function, and result joined with original dataframe, and grouped:
val df = Seq(
("AA1", "2021-01-01", 25),
("AA1", "2022-01-01", 103)
)
.toDF("code", "monthdate", "turnover")
.withColumn("monthdate", to_date($"monthdate", "yyyy-MM-dd"))
val oneYearBackMonths = (0 to 12).map(n => lit(-n))
val explodedWithBackMonths = df
.withColumn("shift", explode(array(oneYearBackMonths: _*)))
.withColumn("rangeMonth",expr("add_months(monthdate, shift)"))
val joinCondition = $"exploded.code" === $"original.code" &&
$"exploded.rangeMonth" === $"original.monthdate"
explodedWithBackMonths.alias("exploded")
.join(df.alias("original"), joinCondition)
.groupBy($"exploded.code", $"exploded.monthdate")
.agg(sum($"original.turnover").alias("oneYearTurnover"))
Result:
+----+----------+---------------+
|code|monthdate |oneYearTurnover|
+----+----------+---------------+
|AA1 |2021-01-01|25 |
|AA1 |2022-01-01|128 |
+----+----------+---------------+

you can use Spark's Window function
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val raw = Seq(
("AA1", "2019-01-01", 25),
("AA1", "2021-01-01", 25),
("AA1","2021-08-01",80),
("AA1" ,"2021-09-01" , 90 ),
("AA1", "2022-01-01", 103),
("AA2", "2022-01-01", 10)
).toDF("code", "monthdate", "turnover")
val df = raw.withColumn("monthdate",to_timestamp($"monthdate","yyyy-mm-dd"))
val pw = Window.partitionBy($"code").orderBy($"monthdate".cast("long")).rangeBetween(-(86400*365), 0)
df.withColumn("sum",sum($"turnover").over(pw)).show()
+----+-------------------+--------+---+
|code| monthdate|turnover|sum|
+----+-------------------+--------+---+
| AA1|2019-01-01 00:01:00| 25| 25|
| AA1|2021-01-01 00:01:00| 25| 25|
| AA1|2021-01-01 00:08:00| 80|105|
| AA1|2021-01-01 00:09:00| 90|195|
| AA1|2022-01-01 00:01:00| 103|298|
| AA2|2022-01-01 00:01:00| 10| 10|
+----+-------------------+--------+---+

I created 2 window functions for testing, can you please check this and comment whether this is fine.
val w = Window.partitionBy($"code")
.orderBy($"rownum")
.rowsBetween(-11, Window.currentRow)
val w1 = Window.partitionBy($"code")
.orderBy($"monthdate")
val newDf = initDf.withColumn("rownum", row_number().over(w1))
.withColumn("csum",sum("turnover").over(w))
We may need to first group by Month & Year and take the sum of the
turnover for that month, then sort by month for that code

Related

How to do yearly comparison in spark scala

I have dataframe which contains columns like Month and Qty as you can see in below table:
| Month | Fruit | Qty |
| -------- | ------ | ------ |
| 2021-01 | orange | 5223 |
| 2021-02 | orange | 23 |
| ...... | ..... | ..... |
| 2022-01 | orange | 2342 |
| 2022-02 | orange | 37667 |
I need to do sum of the Qty group by the Fruit. My output DF will be below table:
| Year | Fruit | sum_of_qty_This_year | sum_of_qty_previous_year |
| ---- | -------- | --------------------- | -------------------------- |
| 2022 | orange | 29384 | 34534 |
| 2021 | orange | 34534 | 93584 |
but there is a catch here, consider below table.
| current year | jan | feb | mar | apr | may | jun | jul | aug | sep | oct | nov | dec |
| --------------------------------------------------------------------------------------------------------|
| previous year | jan | feb | | apr | may | jun | jul | aug | | oct | nov | dec |
as you can see the data for mar and sep is missing in previous year. So when we calculate sum of current year, Qty should exclude the missing months. and this should be done for each year
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
import spark.implicits._
val df1 = Seq(
("2021-01", "orange", 5223),
("2021-02", "orange", 23),
("2022-01", "orange", 2342),
("2022-02", "orange", 37667),
("2022-03", "orange", 50000)
).toDF("Month", "Fruit", "Qty")
val currentYear = 2022
val priorYear = 2021
val currentYearDF = df1
.filter(col("Month").substr(1, 4) === currentYear)
val priorYearDF = df1
.filter(col("Month").substr(1, 4) === priorYear)
.withColumnRenamed("Month", "MonthP")
.withColumnRenamed("Fruit", "FruitP")
.withColumnRenamed("Qty", "QtyP")
val resDF = priorYearDF
.join(
currentYearDF,
priorYearDF
.col("FruitP") === currentYearDF.col("Fruit") && priorYearDF
.col("MonthP")
.substr(6, 2) === currentYearDF.col("Month").substr(6, 2)
)
.select(
currentYearDF.col("Fruit").as("Fruit"),
currentYearDF.col("Qty").as("CurrentYearQty"),
priorYearDF.col("QtyP").as("PriorYearQty")
)
.groupBy("Fruit")
.agg(
sum("CurrentYearQty").as("sum_of_qty_This_year"),
sum("PriorYearQty").as("sum_of_qty_previous_year")
)
resDF.show(false)
// +------+--------------------+------------------------+
// |Fruit |sum_of_qty_This_year|sum_of_qty_previous_year|
// +------+--------------------+------------------------+
// |orange|40009 |5246 |
// +------+--------------------+------------------------+

PySpark : Merge dataframes where one value(from 1st dataframe) is between two others(from 2nd dataframe)

I need to merge two dataframes on an identifier and condition where a date in one dataframe is between two dates in the other dataframe and groupby (calculate the sum) of the other column
Dataframe A has a date ("date"), number("number") and an ID ("id"):
| id | date | number |
| 101 | 2018-12-01 | 250 |
| 101 | 2018-12-02 | 150 |
| 102 | 2018-11-25 | 1000 |
| 102 | 2018-10-26 | 2000 |
| 102 | 2018-09-25 | 5000 |
| 103 | 2018-10-26 | 200 |
| 103 | 2018-10-27 | 2000 |
Dataframe B has Id("id"), fromdate("fromdate") and a todate("todate"):
| id | fromdate | todate |
| 101 | 2018-10-01 | 2018-11-01 |
| 101 | 2018-11-02 | 2018-12-30 |
| 102 | 2018-09-01 | 2018-09-30 |
| 102 | 2018-10-01 | 2018-12-31 |
| 103 | 2018-10-01 | 2018-10-30 |
| 104 | 2018-10-01 | 2018-10-30 |
Now I need to merge these two dataframes on id and date and then sum all the numbers accordingly.
For example:
Consider fourth row in dataframe B, For id 102, and in between those dates, we have two corresponding rows(Row #3,4) from dataframe Am Merge them by calculating the sum.
So the resulting row would be
| id | fromdate | todate | sum |
| 102 | 2018-10-01 | 2018-12-31 | 3000 |
End result should be:
| id | fromdate | todate | sum |
| 101 | 2018-10-01 | 2018-11-01 | 0 |
| 101 | 2018-11-02 | 2018-12-30 | 400 |
| 102 | 2018-09-01 | 2018-09-30 | 5000 |
| 102 | 2018-10-01 | 2018-12-31 | 3000 |
| 103 | 2018-10-01 | 2018-10-30 | 2200 |
| 104 | 2018-10-01 | 2018-10-30 | 0 |
Here is detailed approach you can follow -
from pyspark.sql.types import *
################
##Define Schema
################
schema1 = StructType([StructField('id', IntegerType(), True),
StructField('date', StringType(), True),
StructField('number', IntegerType(), True)
]
)
schema2 = StructType([StructField('id', IntegerType(), True),
StructField('fromdate', StringType(), True),
StructField('todate', StringType(), True)
]
)
################
##Prepare Data
################
data1 = [
(101,'2018-12-01',250 ),
(101,'2018-12-02',150 ),
(102,'2018-11-25',1000),
(102,'2018-10-26',2000),
(102,'2018-09-25',5000),
(103,'2018-10-26',200 ),
(103,'2018-10-27',2000)
]
data2 = [
(101,'2018-10-01','2018-11-01'),
(101,'2018-11-02','2018-12-30'),
(102,'2018-09-01','2018-09-30'),
(102,'2018-10-01','2018-12-31'),
(103,'2018-10-01','2018-10-30'),
(104,'2018-10-01','2018-10-30')
]
################
##Create dataframe and type cast to date
################
df1 = spark.createDataFrame(data1, schema1)
df2 = spark.createDataFrame(data2, schema2)
df1 = df1.select(df1.id,df1.date.cast("date"),df1.number)
df2 = df2.select(df2.id,df2.fromdate.cast("date"),df2.todate.cast("date"))
Define join condition and join dataframes
################
##Define Joining Condition
################
cond = [df1.id == df2.id, df1.date.between(df2.fromdate,df2.todate)]
################
##Join dataframes using joining condition "cond" and aggregation
################
from pyspark.sql.functions import coalesce
df2.\
join(df1, cond,'left').\
select(df2.id,df1.number,df2.fromdate,df2.todate).\
groupBy('id','fromdate','todate').\
sum('number').fillna(0).\
show()

In Spark scala, how to check between adjacent rows in a dataframe

How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe. This should happen at a key level
I have following data after sorting on key, dates
source_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-08 |
| 10 | BAC | 2018-01-03 | 2018-01-15 |
| 10 | CAS | 2018-01-03 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-03 |
| 20 | DAS | 2018-01-01 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
When the dates are in a range from these rows (i.e. the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.
Here is the output I need..
final_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-21 |
| 10 | BAC | 2018-01-01 | 2018-01-21 |
| 10 | CAS | 2018-01-01 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-12 |
| 20 | DAS | 2017-11-12 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
Appreciate any ideas to achieve this. Thanks in advance!
Here's one approach:
Create new column group_id with null value if begin_dt is within date range from the previous row; otherwise a unique integer
Backfill nulls in group_id with the last non-null value
Compute min(begin_dt) and max(end_dt) within each (key, group_id) partition
Example below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(10, "ABC", "2018-01-01", "2018-01-08"),
(10, "BAC", "2018-01-03", "2018-01-15"),
(10, "CAS", "2018-01-03", "2018-01-21"),
(20, "AAA", "2017-11-12", "2018-01-03"),
(20, "DAS", "2018-01-01", "2018-01-12"),
(20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")
val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")
df.
withColumn("group_id", when(
$"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
).otherwise(monotonically_increasing_id)
).
withColumn("group_id", last($"group_id", ignoreNulls=true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("begin_dt2", min($"begin_dt").over(win2)).
withColumn("end_dt2", max($"end_dt").over(win2)).
orderBy("key", "begin_dt", "end_dt").
show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code| begin_dt| end_dt| group_id| begin_dt2| end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+

DB2 Query multiple select and sum by date

I have 3 tables: ITEMS, ODETAILS, OHIST.
ITEMS - a list of products, ID is the key field
ODETAILS - line items of every order, no key field
OHIST - a view showing last years order totals by month
ITEMS ODETAILS OHIST
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| ID | NAME | | OID | ODUE | ITEM_ID | ITEM_QTY | | ITEM_ID | M5QTY |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 10 + Widget10 | | A33 | 1180503 | 10 | 100 | | 10 | 1000 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 11 + Widget11 | | A33 | 1180504 | 11 | 215 | | 11 | 1500 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 12 + Widget12 | | A34 | 1180505 | 10 | 500 | | 12 | 2251 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| 13 + Widget13 | | A34 | 1180504 | 11 | 320 | | 13 | 4334 |
+----+----------+ +-----+---------+---------+----------+ +---------+-------+
| A34 | 1180504 | 12 | 450 |
+-----+---------+---------+----------+
| A34 | 1180505 | 13 | 125 |
+-----+---------+---------+----------+
Assuming today is May 2, 2018 (1180502).
I want my results to show ID, NAME, M5QTY, and SUM(ITEM_QTY) grouped by day
over the next 3 days (D1, D2, D3)
Desired Result
+----+----------+--------+------+------+------+
| ID | NAME | M5QTY | D1 | D2 | D3 |
+----+----------+--------+------+------+------+
| 10 | Widget10 | 1000 | 100 | | 500 |
+----+----------+--------+------+------+------+
| 11 | Widget11 | 1500 | | 535 | |
+----+----------+--------+------+------+------+
| 12 | Widget12 | 2251 | | 450 | |
+----+----------+--------+------+------+------+
| 13 | Widget13 | 4334 | | | 125 |
+----+----------+--------+------+------+------+
This is how I convert ODUE to a date
DATE(concat(concat(concat(substr(char((ODETAILS.ODUE-1000000)+20000000),1,4),'-'), concat(substr(char((ODETAILS.ODUE-1000000)+20000000),5,2), '-')), substr(char((ODETAILS.ODUE-1000000)+20000000),7,2)))
Try this (you can add the joins you need)
SELECT ITEM_ID
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 1 THEN ITEM_QTY ELSE 0 END) AS D1
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 2 THEN ITEM_QTY ELSE 0 END) AS D2
, SUM(CASE WHEN ODUE = INT(CURRENT DATE) - 19000000 + 3 THEN ITEM_QTY ELSE 0 END) AS D3
FROM
ODETAILS
GROUP BY
ITEM_ID

How to do OUTER JOIN in scala

I havce two data frames : df1 and df2
df1
|--- id---|---value---|
| 1 | 23 |
| 2 | 23 |
| 3 | 23 |
| 2 | 25 |
| 5 | 25 |
df2
|-idValue-|---count---|
| 1 | 33 |
| 2 | 23 |
| 3 | 34 |
| 13 | 34 |
| 23 | 34 |
How do I get this ?
|--- id--------|---value---|---count---|
| 1 | 23 | 33 |
| 2 | 23 | 23 |
| 3 | 23 | 34 |
| 2 | 25 | 23 |
| 5 | 25 | null |
I am doing :
val groupedData = df1.join(df2, $"id" === $"idValue", "outer")
But I don't see the last column in the groupedData. Is this correct way of doing ? Or Am I doing any thing wrong ?
From your expected output, you need LEFT OUTER JOIN.
val groupedData = df1.join(df2, $"id" === $"idValue", "left_outer").
select(df1("id"), df1("count"), df2("count")).
take(10).foreach(println)