I created a dataframe in Spark, by groupby column1 and date and calculated the amount.
val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date |Amount
A |1-jul |1000
A |1-june |2000
A |1-May |2000
A |1-dec |3000
A |1-Nov |2000
B |1-jul |100
B |1-june |300
B |1-May |400
B |1-dec |300
Now, I want to add new column, with difference between amount of any two dates from the table.
You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.
But for that you need to change the date column as below so that it can be ordered.
+-------+------+--------------+------+
|Column1|Date |Date_Converted|Amount|
+-------+------+--------------+------+
|A |1-jul |2017-07-01 |1000 |
|A |1-june|2017-06-01 |2000 |
|A |1-May |2017-05-01 |2000 |
|A |1-dec |2017-12-01 |3000 |
|A |1-Nov |2017-11-01 |2000 |
|B |1-jul |2017-07-01 |100 |
|B |1-june|2017-06-01 |300 |
|B |1-May |2017-05-01 |400 |
|B |1-dec |2017-12-01 |300 |
+-------+------+--------------+------+
You can find the difference between previous month and current month by doing
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
.show(false)
You should have
+-------+------+--------------+------+------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |-100.0 |
|B |1-jul |2017-07-01 |100 |-200.0 |
|B |1-dec |2017-12-01 |300 |200.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |0.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |1000.0 |
|A |1-dec |2017-12-01 |3000 |1000.0 |
+-------+------+--------------+------+------------------------+
You can increase the lagging position for previous two months as
df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
.show(false)
which will give you
+-------+------+--------------+------+----------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |300.0 |
|B |1-jul |2017-07-01 |100 |-300.0 |
|B |1-dec |2017-12-01 |300 |0.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |2000.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |0.0 |
|A |1-dec |2017-12-01 |3000 |2000.0 |
+-------+------+--------------+------+----------------------------+
I hope the answer is helpful
Assumming those two dates belong to each group of your table
my imports :
import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}
Perpare the dataframe
scala> val seqRow = Seq(
| ("A","1- jul",1000),
| ("A","1-june",2000),
| ("A","1-May",2000),
| ("A","1-dec",3000),
| ("B","1-jul",100),
| ("B","1-june",300),
| ("B","1-May",400),
| ("B","1-dec",300))
seqRow: Seq[(String, String, Int)] = List((A,1- jul,1000), (A,1-june,2000), (A,1-May,2000), (A,1-dec,3000), (B,1-jul,100), (B,1-june,300), (B,1-May,400), (B,1-dec,300))
scala> val input_df = sc.parallelize(seqRow).toDF("column1","date","amount")
input_df: org.apache.spark.sql.DataFrame = [column1: string, date: string ... 1 more field]
Now write a UDF for your case,
scala> def calc_diff = udf((list : Seq[String],startMonth : String,endMonth : String) => {
| //get the month and their values
| val monthMap = list.map{str =>
| val splitText = str.split("\\$")
| val month = splitText(0).split("-")(1).trim
|
| (month.toLowerCase,splitText(1).toInt)
| }.toMap
|
| val stMnth = monthMap(startMonth)
| val endMnth = monthMap(endMonth)
| endMnth - stMnth
|
| })
calc_diff: org.apache.spark.sql.expressions.UserDefinedFunction
Now, Preparing the output
scala> val (month1 : String,month2 : String) = ("jul","dec")
month1: String = jul
month2: String = dec
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase)))
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 2 more fields]
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase))).drop('collect_val)
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 1 more field]
scala> req_df.orderBy('column1).show
+-------+----------+----+
|column1|sum_amount|diff|
+-------+----------+----+
| A| 8000|2000|
| B| 1100| 200|
+-------+----------+----+
Hope, this is what you want.
(table.filter($"Date".isin("1-jul", "1-dec"))
.groupBy("Column1")
.pivot("Date")
.agg(first($"Amount"))
.withColumn("diff", $"1-dec" - $"1-jul")
).show
+-------+-----+-----+----+
|Column1|1-dec|1-jul|diff|
+-------+-----+-----+----+
| B| 300| 100| 200|
| A| 3000| 1000|2000|
+-------+-----+-----+----+
Related
I trying to create a new column and compare it with another one, if are equal I have to put "Yes" else "No" as you can see here:
+----+-------+-----------+----------+
|Game| statB | statPrev | Change |
+----+-------+-----------+----------+
| CA| 2 | 2 | No |
| BL| 5 | 2 | Yes |
| CD| null | null | No |
| NT| 4 | 5 | Yes |
| FT| 6 | null | Yes |
+----+-------+-----------+----------+
What I am trying is:
var df1 = df.withColumn("Change",
when($"statB" =!= $"statPrev"
|| $"statPrev".isNull && $"statB".isNotNull
|| $"statPrev".isNotNull && $"statB".isNotNull, "Yes").otherwise("No"))
But for example when StatB and statPrev are both nulls, I get an "Yes"... What am I doing wrong?
To compare equality with nulls, you can use eqNullSafe for a simpler syntax:
val df2 = df.withColumn(
"Change",
when($"statB".eqNullSafe($"statPrev"), "Yes").otherwise("No")
)
df2.show
+----+-----+--------+------+
|Game|statB|statPrev|Change|
+----+-----+--------+------+
| CA| 2| 2| Yes|
| BL| 5| 2| No|
| CD| null| null| Yes|
| NT| 4| 5| No|
| FT| 6| null| No|
+----+-----+--------+------+
According to your question, if are equal I have to put "Yes" else "No"
It should be
var df1 = df.withColumn("Change1",
when($"statB" === $"statPrev" || ($"statB".isNull && $"statPrev".isNull),
"Yes").otherwise("No"))
df1.show(false)
Or you could use null safe equal operator as
df.withColumn("Change1",
when(($"statB" <=> $"statPrev" ), "Yes").otherwise("No"))
.show(false)
Result:
+----+-----+--------+------+
|Game|statB|statPrev|Change|
+----+-----+--------+------+
|CA |2 |2 |Yes |
|BL |5 |2 |No |
|CD |null |null |Yes |
|NT |4 |5 |No |
|FT |6 |null |No |
+----+-----+--------+------+
If stateB and statePrev equals :
df.withColumn("Change", when($"stateB" === $"statePrev", lit("YES")).otherwise("NO")).show(false);
output
+----+------+---------+---+
|Game|stateB|statePrev|Change|
+----+------+---------+---+
|CA |2 |2 |YES|
|BL |5 |2 |NO |
|CD |null |null |YES|
|NT |4 |5 |NO |
|FT |6 |null |NO |
+----+------+---------+---+
if you want to tell No if the null values on stateB and statePrev -
df.withColumn("Change",
when(($"stateB" === $"statePrev") && ($"stateB".notEqual( "null")
&& $"statePrev".notEqual( "null")),
lit("YES")).otherwise("NO")).show(false)
output
+----+------+---------+------+
|Game|stateB|statePrev|Change|
+----+------+---------+------+
|CA |2 |2 |YES |
|BL |5 |2 |NO |
|CD |null |null |NO |
|NT |4 |5 |NO |
|FT |6 |null |NO |
+----+------+---------+------+
I have a PySpark DataFrame :
From id To id Price Date
a b 20 30/05/2019
b c 5 30/05/2019
c a 20 30/05/2019
a d 10 02/06/2019
d c 5 02/06/2019
id Name
a Claudia
b Manuella
c remy
d Paul
The output that i want is :
Date Name current balance
30/05/2019 Claudia 0
30/05/2019 Manuella 15
30/05/2019 Remy -15
30/05/2019 Paul 0
02/06/2019 Claudia -10
02/06/2019 Manuella 15
02/06/2019 Remy -10
02/06/2019 Paul 5
I want to get the current balance in each day for all users.
my idea is to make a groupby per user and calculate the sum of the TO column minus the From column. But how to do it per day? especially it's cumulative and not per day?
Thank You
I took a bit of an effort to get the requirements right. Here's my version of the solution.
from pyspark.sql import Row
from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window
sc = SparkContext('local')
sqlContext = SQLContext(sc)
data1 = [
("a","b",20,"30/05/2019"),
("b","c",5 ,"30/05/2019"),
("c","a",20,"30/05/2019"),
("a","d",10,"02/06/2019"),
("d","c",5 ,"02/06/2019"),
]
df1Columns = ["From_Id", "To_Id", "Price", "Date"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("Date",F.to_date(F.to_timestamp("Date", 'dd/MM/yyyy')).alias('Date'))
print("Actual initial data")
df1.show(truncate=False)
data2 = [
("a","Claudia"),
("b","Manuella"),
("c","Remy"),
("d","Paul"),
]
df2Columns = ["id","Name"]
df2 = sqlContext.createDataFrame(data=data2, schema = df2Columns)
print("Actual initial data")
df2.show(truncate=False)
alldays_df = df1.select("Date").distinct().repartition(20)
allusers_df = df2.select("id").distinct().repartition(10)
crossjoin_df = alldays_df.crossJoin(allusers_df)
crossjoin_df = crossjoin_df.withColumn("initial", F.lit(0))
crossjoin_df = crossjoin_df.withColumnRenamed("id", "common_id").cache()
crossjoin_df.show(n=40, truncate=False)
from_sum_df = df1.groupby("Date", "From_Id").agg(F.sum("Price").alias("from_sum"))
from_sum_df = from_sum_df.withColumnRenamed("From_Id", "common_id")
from_sum_df.show(truncate=False)
from_sum_df = crossjoin_df.alias('cross').join(
from_sum_df.alias('from'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('from.from_sum', 'cross.initial').alias('from_amount') ).cache()
from_sum_df.show(truncate=False)
to_sum_df = df1.groupby("Date", "To_Id").agg(F.sum("Price").alias("to_sum"))
to_sum_df = to_sum_df.withColumnRenamed("To_Id", "common_id")
to_sum_df.show(truncate=False)
to_sum_df = crossjoin_df.alias('cross').join(
to_sum_df.alias('to'), ['Date', 'common_id'], how='outer'
).select('Date', 'common_id',
F.coalesce('to.to_sum', 'cross.initial').alias('to_amount') ).cache()
to_sum_df.show(truncate=False)
joined_df = to_sum_df.join(from_sum_df, ["Date", "common_id"], how='inner')
joined_df.show(truncate=False)
balance_df = joined_df.withColumn("balance", F.col("to_amount") - F.col("from_amount"))
balance_df.show(truncate=False)
final_df = balance_df.join(df2, F.col("id") == F.col("common_id"))
final_df.show(truncate=False)
final_cum_sum = final_df.withColumn('cumsum_balance', F.sum('balance').over(Window.partitionBy('common_id').orderBy('Date').rowsBetween(-sys.maxsize, 0)))
final_cum_sum.show()
Following are all the outputs for your progressive understanding. I am not explaining the steps. You can figure them out.
Actual initial data
+-------+-----+-----+----------+
|From_Id|To_Id|Price|Date |
+-------+-----+-----+----------+
|a |b |20 |2019-05-30|
|b |c |5 |2019-05-30|
|c |a |20 |2019-05-30|
|a |d |10 |2019-06-02|
|d |c |5 |2019-06-02|
+-------+-----+-----+----------+
Actual initial data
+---+--------+
|id |Name |
+---+--------+
|a |Claudia |
|b |Manuella|
|c |Remy |
|d |Paul |
+---+--------+
+----------+---------+-------+
|Date |common_id|initial|
+----------+---------+-------+
|2019-05-30|a |0 |
|2019-05-30|d |0 |
|2019-05-30|b |0 |
|2019-05-30|c |0 |
|2019-06-02|a |0 |
|2019-06-02|d |0 |
|2019-06-02|b |0 |
|2019-06-02|c |0 |
+----------+---------+-------+
+----------+---------+--------+
|Date |common_id|from_sum|
+----------+---------+--------+
|2019-06-02|a |10 |
|2019-05-30|a |20 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
+----------+---------+--------+
+----------+---------+-----------+
|Date |common_id|from_amount|
+----------+---------+-----------+
|2019-06-02|a |10 |
|2019-06-02|c |0 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |5 |
|2019-05-30|c |20 |
|2019-05-30|b |5 |
+----------+---------+-----------+
+----------+---------+------+
|Date |common_id|to_sum|
+----------+---------+------+
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
+----------+---------+------+
+----------+---------+---------+
|Date |common_id|to_amount|
+----------+---------+---------+
|2019-06-02|a |0 |
|2019-06-02|c |5 |
|2019-05-30|a |20 |
|2019-05-30|d |0 |
|2019-06-02|b |0 |
|2019-06-02|d |10 |
|2019-05-30|c |5 |
|2019-05-30|b |20 |
+----------+---------+---------+
+----------+---------+---------+-----------+
|Date |common_id|to_amount|from_amount|
+----------+---------+---------+-----------+
|2019-06-02|a |0 |10 |
|2019-06-02|c |5 |0 |
|2019-05-30|a |20 |20 |
|2019-05-30|d |0 |0 |
|2019-06-02|b |0 |0 |
|2019-06-02|d |10 |5 |
|2019-05-30|c |5 |20 |
|2019-05-30|b |20 |5 |
+----------+---------+---------+-----------+
+----------+---------+---------+-----------+-------+
|Date |common_id|to_amount|from_amount|balance|
+----------+---------+---------+-----------+-------+
|2019-06-02|a |0 |10 |-10 |
|2019-06-02|c |5 |0 |5 |
|2019-05-30|a |20 |20 |0 |
|2019-05-30|d |0 |0 |0 |
|2019-06-02|b |0 |0 |0 |
|2019-06-02|d |10 |5 |5 |
|2019-05-30|c |5 |20 |-15 |
|2019-05-30|b |20 |5 |15 |
+----------+---------+---------+-----------+-------+
+----------+---------+---------+-----------+-------+---+--------+
|Date |common_id|to_amount|from_amount|balance|id |Name |
+----------+---------+---------+-----------+-------+---+--------+
|2019-05-30|a |20 |20 |0 |a |Claudia |
|2019-06-02|a |0 |10 |-10 |a |Claudia |
|2019-05-30|b |20 |5 |15 |b |Manuella|
|2019-06-02|b |0 |0 |0 |b |Manuella|
|2019-05-30|c |5 |20 |-15 |c |Remy |
|2019-06-02|c |5 |0 |5 |c |Remy |
|2019-06-02|d |10 |5 |5 |d |Paul |
|2019-05-30|d |0 |0 |0 |d |Paul |
+----------+---------+---------+-----------+-------+---+--------+
+----------+---------+---------+-----------+-------+---+--------+--------------+
| Date|common_id|to_amount|from_amount|balance| id| Name|cumsum_balance|
+----------+---------+---------+-----------+-------+---+--------+--------------+
|2019-05-30| d| 0| 0| 0| d| Paul| 0|
|2019-06-02| d| 10| 5| 5| d| Paul| 5|
|2019-05-30| c| 5| 20| -15| c| Remy| -15|
|2019-06-02| c| 5| 0| 5| c| Remy| -10|
|2019-05-30| b| 20| 5| 15| b|Manuella| 15|
|2019-06-02| b| 0| 0| 0| b|Manuella| 15|
|2019-05-30| a| 20| 20| 0| a| Claudia| 0|
|2019-06-02| a| 0| 10| -10| a| Claudia| -10|
+----------+---------+---------+-----------+-------+---+--------+--------------+
I want to merge 2 columns or 2 dataframes like
df1
+--+
|id|
+--+
|1 |
|2 |
|3 |
+--+
df2 --> this one can be a list as well
+--+
|m |
+--+
|A |
|B |
|C |
+--+
I want to have as resulting table
+--+--+
|id|m |
+--+--+
|1 |A |
|1 |B |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |A |
|3 |B |
|3 |C |
+--+--+
def crossJoin(right: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame
Using crossJoin function you can get same result. Please check code below.
scala> dfa.show
+---+
| id|
+---+
| 1|
| 2|
| 3|
+---+
scala> dfb.show
+---+
| m|
+---+
| A|
| B|
| C|
+---+
scala> dfa.crossJoin(dfb).orderBy($"id".asc).show(false)
+---+---+
|id |m |
+---+---+
|1 |B |
|1 |A |
|1 |C |
|2 |A |
|2 |B |
|2 |C |
|3 |C |
|3 |B |
|3 |A |
+---+---+
I have a dataframe like this:
+-----+---------+---------+
|Categ| Amt| price|
+-----+---------+---------+
| A| 100| 1|
| A| 180| 2|
| A| 250| 3|
| B| 90| 2|
| B| 170| 3|
| B| 280| 3|
+-----+---------+---------+
I want to grouped by "categ" to calculate the mean price in overlapped ranges.
Let's say those ranges are [0-200] and [150-300].
So the output that I'd like to get is like this:
+-----+---------+---------+
|Categ|rang(Amt)| mean(price)|
+-----+---------+---------+
| A| [0-200]| 1.5|
| A|[150-300]| 2.5|
| B| [0-200]| 2.5|
| B|[150-300]| 3|
+-----+---------+---------+
Check out this.
scala> val df = Seq(("A",100,1),("A",180,2),("A",250,3),("B",90,2),("B",170,3),("B",280,3)).toDF("categ","amt","price")
df: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 1 more field]
scala> df.show(false)
+-----+---+-----+
|categ|amt|price|
+-----+---+-----+
|A |100|1 |
|A |180|2 |
|A |250|3 |
|B |90 |2 |
|B |170|3 |
|B |280|3 |
+-----+---+-----+
scala> val df2 = df.withColumn("newc",array(when('amt>=0 and 'amt <=200, map(lit("[0-200]"),'price)),when('amt>150 and 'amt<=300, map(lit("[150-3
00]"),'price))))
df2: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 2 more fields]
scala> val df3 = df2.select(col("*"), explode('newc).as("rangekv")).select(col("*"),explode('rangekv).as(Seq("range","price2")))
df3: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 5 more fields]
scala> df3.show(false)
+-----+---+-----+----------------------------------+----------------+---------+------+
|categ|amt|price|newc |rangekv |range |price2|
+-----+---+-----+----------------------------------+----------------+---------+------+
|A |100|1 |[[[0-200] -> 1],] |[[0-200] -> 1] |[0-200] |1 |
|A |180|2 |[[[0-200] -> 2], [[150-300] -> 2]]|[[0-200] -> 2] |[0-200] |2 |
|A |180|2 |[[[0-200] -> 2], [[150-300] -> 2]]|[[150-300] -> 2]|[150-300]|2 |
|A |250|3 |[, [[150-300] -> 3]] |[[150-300] -> 3]|[150-300]|3 |
|B |90 |2 |[[[0-200] -> 2],] |[[0-200] -> 2] |[0-200] |2 |
|B |170|3 |[[[0-200] -> 3], [[150-300] -> 3]]|[[0-200] -> 3] |[0-200] |3 |
|B |170|3 |[[[0-200] -> 3], [[150-300] -> 3]]|[[150-300] -> 3]|[150-300]|3 |
|B |280|3 |[, [[150-300] -> 3]] |[[150-300] -> 3]|[150-300]|3 |
+-----+---+-----+----------------------------------+----------------+---------+------+
scala> df3.groupBy('categ,'range).agg(avg('price)).orderBy('categ).show(false)
+-----+---------+----------+
|categ|range |avg(price)|
+-----+---------+----------+
|A |[0-200] |1.5 |
|A |[150-300]|2.5 |
|B |[0-200] |2.5 |
|B |[150-300]|3.0 |
+-----+---------+----------+
scala>
You can also create an Array of range strings and explode them. But in this case, you will get NULL after exploding, so you need to filter them.
scala> val df2 = df.withColumn("newc",array(when('amt>=0 and 'amt <=200, lit("[0-200]")),when('amt>150 and 'amt<=300,lit("[150-300]") )))
df2: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 2 more fields]
scala> val df3 = df2.select(col("*"), explode('newc).as("range"))
df3: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 3 more fields]
scala> df3.groupBy('categ,'range).agg(avg('price)).orderBy('categ).show(false)
+-----+---------+----------+
|categ|range |avg(price)|
+-----+---------+----------+
|A |[150-300]|2.5 |
|A |[0-200] |1.5 |
|A |null |2.0 |
|B |[0-200] |2.5 |
|B |null |2.5 |
|B |[150-300]|3.0 |
+-----+---------+----------+
scala> df3.groupBy('categ,'range).agg(avg('price)).filter(" range is not null ").orderBy('categ).show(false)
+-----+---------+----------+
|categ|range |avg(price)|
+-----+---------+----------+
|A |[150-300]|2.5 |
|A |[0-200] |1.5 |
|B |[0-200] |2.5 |
|B |[150-300]|3.0 |
+-----+---------+----------+
scala>
You can filter your values before grouping, add range name column and then union the results.
agg_range_0_200 = df
.filter('Amt > 0 and Amt < 200')
.groupBy('Categ').agg(mean('price'))
.withColumn('rang(Amt)', '[0-200]')
agg_range_150_300 = df
.filter('Amt > 150 and Amt < 300')
.groupBy('Categ').agg(mean('price'))
.withColumn('rang(Amt)', '[150-300]')
agg_range = agg_range_0_200.union(agg_range_150_300)
I created a dataframe in Spark, by groupby column1 and date and calculated the amount.
val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date |Amount
A |1-jul |1000
A |1-june |2000
A |1-May |2000
A |1-dec |3000
A |1-Nov |2000
B |1-jul |100
B |1-june |300
B |1-May |400
B |1-dec |300
Now, I want to add new column, with difference between amount of any two dates from the table.
You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.
But for that you need to change the date column as below so that it can be ordered.
+-------+------+--------------+------+
|Column1|Date |Date_Converted|Amount|
+-------+------+--------------+------+
|A |1-jul |2017-07-01 |1000 |
|A |1-june|2017-06-01 |2000 |
|A |1-May |2017-05-01 |2000 |
|A |1-dec |2017-12-01 |3000 |
|A |1-Nov |2017-11-01 |2000 |
|B |1-jul |2017-07-01 |100 |
|B |1-june|2017-06-01 |300 |
|B |1-May |2017-05-01 |400 |
|B |1-dec |2017-12-01 |300 |
+-------+------+--------------+------+
You can find the difference between previous month and current month by doing
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
.show(false)
You should have
+-------+------+--------------+------+------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |-100.0 |
|B |1-jul |2017-07-01 |100 |-200.0 |
|B |1-dec |2017-12-01 |300 |200.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |0.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |1000.0 |
|A |1-dec |2017-12-01 |3000 |1000.0 |
+-------+------+--------------+------+------------------------+
You can increase the lagging position for previous two months as
df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
.show(false)
which will give you
+-------+------+--------------+------+----------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |300.0 |
|B |1-jul |2017-07-01 |100 |-300.0 |
|B |1-dec |2017-12-01 |300 |0.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |2000.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |0.0 |
|A |1-dec |2017-12-01 |3000 |2000.0 |
+-------+------+--------------+------+----------------------------+
I hope the answer is helpful
Assumming those two dates belong to each group of your table
my imports :
import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}
Perpare the dataframe
scala> val seqRow = Seq(
| ("A","1- jul",1000),
| ("A","1-june",2000),
| ("A","1-May",2000),
| ("A","1-dec",3000),
| ("B","1-jul",100),
| ("B","1-june",300),
| ("B","1-May",400),
| ("B","1-dec",300))
seqRow: Seq[(String, String, Int)] = List((A,1- jul,1000), (A,1-june,2000), (A,1-May,2000), (A,1-dec,3000), (B,1-jul,100), (B,1-june,300), (B,1-May,400), (B,1-dec,300))
scala> val input_df = sc.parallelize(seqRow).toDF("column1","date","amount")
input_df: org.apache.spark.sql.DataFrame = [column1: string, date: string ... 1 more field]
Now write a UDF for your case,
scala> def calc_diff = udf((list : Seq[String],startMonth : String,endMonth : String) => {
| //get the month and their values
| val monthMap = list.map{str =>
| val splitText = str.split("\\$")
| val month = splitText(0).split("-")(1).trim
|
| (month.toLowerCase,splitText(1).toInt)
| }.toMap
|
| val stMnth = monthMap(startMonth)
| val endMnth = monthMap(endMonth)
| endMnth - stMnth
|
| })
calc_diff: org.apache.spark.sql.expressions.UserDefinedFunction
Now, Preparing the output
scala> val (month1 : String,month2 : String) = ("jul","dec")
month1: String = jul
month2: String = dec
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase)))
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 2 more fields]
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase))).drop('collect_val)
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 1 more field]
scala> req_df.orderBy('column1).show
+-------+----------+----+
|column1|sum_amount|diff|
+-------+----------+----+
| A| 8000|2000|
| B| 1100| 200|
+-------+----------+----+
Hope, this is what you want.
(table.filter($"Date".isin("1-jul", "1-dec"))
.groupBy("Column1")
.pivot("Date")
.agg(first($"Amount"))
.withColumn("diff", $"1-dec" - $"1-jul")
).show
+-------+-----+-----+----+
|Column1|1-dec|1-jul|diff|
+-------+-----+-----+----+
| B| 300| 100| 200|
| A| 3000| 1000|2000|
+-------+-----+-----+----+