I have a dataframe like this:
+-----+---------+---------+
|Categ| Amt| price|
+-----+---------+---------+
| A| 100| 1|
| A| 180| 2|
| A| 250| 3|
| B| 90| 2|
| B| 170| 3|
| B| 280| 3|
+-----+---------+---------+
I want to grouped by "categ" to calculate the mean price in overlapped ranges.
Let's say those ranges are [0-200] and [150-300].
So the output that I'd like to get is like this:
+-----+---------+---------+
|Categ|rang(Amt)| mean(price)|
+-----+---------+---------+
| A| [0-200]| 1.5|
| A|[150-300]| 2.5|
| B| [0-200]| 2.5|
| B|[150-300]| 3|
+-----+---------+---------+
Check out this.
scala> val df = Seq(("A",100,1),("A",180,2),("A",250,3),("B",90,2),("B",170,3),("B",280,3)).toDF("categ","amt","price")
df: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 1 more field]
scala> df.show(false)
+-----+---+-----+
|categ|amt|price|
+-----+---+-----+
|A |100|1 |
|A |180|2 |
|A |250|3 |
|B |90 |2 |
|B |170|3 |
|B |280|3 |
+-----+---+-----+
scala> val df2 = df.withColumn("newc",array(when('amt>=0 and 'amt <=200, map(lit("[0-200]"),'price)),when('amt>150 and 'amt<=300, map(lit("[150-3
00]"),'price))))
df2: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 2 more fields]
scala> val df3 = df2.select(col("*"), explode('newc).as("rangekv")).select(col("*"),explode('rangekv).as(Seq("range","price2")))
df3: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 5 more fields]
scala> df3.show(false)
+-----+---+-----+----------------------------------+----------------+---------+------+
|categ|amt|price|newc |rangekv |range |price2|
+-----+---+-----+----------------------------------+----------------+---------+------+
|A |100|1 |[[[0-200] -> 1],] |[[0-200] -> 1] |[0-200] |1 |
|A |180|2 |[[[0-200] -> 2], [[150-300] -> 2]]|[[0-200] -> 2] |[0-200] |2 |
|A |180|2 |[[[0-200] -> 2], [[150-300] -> 2]]|[[150-300] -> 2]|[150-300]|2 |
|A |250|3 |[, [[150-300] -> 3]] |[[150-300] -> 3]|[150-300]|3 |
|B |90 |2 |[[[0-200] -> 2],] |[[0-200] -> 2] |[0-200] |2 |
|B |170|3 |[[[0-200] -> 3], [[150-300] -> 3]]|[[0-200] -> 3] |[0-200] |3 |
|B |170|3 |[[[0-200] -> 3], [[150-300] -> 3]]|[[150-300] -> 3]|[150-300]|3 |
|B |280|3 |[, [[150-300] -> 3]] |[[150-300] -> 3]|[150-300]|3 |
+-----+---+-----+----------------------------------+----------------+---------+------+
scala> df3.groupBy('categ,'range).agg(avg('price)).orderBy('categ).show(false)
+-----+---------+----------+
|categ|range |avg(price)|
+-----+---------+----------+
|A |[0-200] |1.5 |
|A |[150-300]|2.5 |
|B |[0-200] |2.5 |
|B |[150-300]|3.0 |
+-----+---------+----------+
scala>
You can also create an Array of range strings and explode them. But in this case, you will get NULL after exploding, so you need to filter them.
scala> val df2 = df.withColumn("newc",array(when('amt>=0 and 'amt <=200, lit("[0-200]")),when('amt>150 and 'amt<=300,lit("[150-300]") )))
df2: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 2 more fields]
scala> val df3 = df2.select(col("*"), explode('newc).as("range"))
df3: org.apache.spark.sql.DataFrame = [categ: string, amt: int ... 3 more fields]
scala> df3.groupBy('categ,'range).agg(avg('price)).orderBy('categ).show(false)
+-----+---------+----------+
|categ|range |avg(price)|
+-----+---------+----------+
|A |[150-300]|2.5 |
|A |[0-200] |1.5 |
|A |null |2.0 |
|B |[0-200] |2.5 |
|B |null |2.5 |
|B |[150-300]|3.0 |
+-----+---------+----------+
scala> df3.groupBy('categ,'range).agg(avg('price)).filter(" range is not null ").orderBy('categ).show(false)
+-----+---------+----------+
|categ|range |avg(price)|
+-----+---------+----------+
|A |[150-300]|2.5 |
|A |[0-200] |1.5 |
|B |[0-200] |2.5 |
|B |[150-300]|3.0 |
+-----+---------+----------+
scala>
You can filter your values before grouping, add range name column and then union the results.
agg_range_0_200 = df
.filter('Amt > 0 and Amt < 200')
.groupBy('Categ').agg(mean('price'))
.withColumn('rang(Amt)', '[0-200]')
agg_range_150_300 = df
.filter('Amt > 150 and Amt < 300')
.groupBy('Categ').agg(mean('price'))
.withColumn('rang(Amt)', '[150-300]')
agg_range = agg_range_0_200.union(agg_range_150_300)
Related
Is it possible to explode multiple columns into one new column in spark? I have a dataframe which looks like this:
userId varA varB
1 [0,2,5] [1,2,9]
desired output:
userId bothVars
1 0
1 2
1 5
1 1
1 2
1 9
What I have tried so far:
val explodedDf = df.withColumn("bothVars", explode($"varA")).drop("varA")
.withColumn("bothVars", explode($"varB")).drop("varB")
which doesn't work. Any suggestions is much appreciated.
You could wrap the two arrays into one and flatten the nested array before exploding it, as shown below:
val df = Seq(
(1, Seq(0, 2, 5), Seq(1, 2, 9)),
(2, Seq(1, 3, 4), Seq(2, 3, 8))
).toDF("userId", "varA", "varB")
df.
select($"userId", explode(flatten(array($"varA", $"varB"))).as("bothVars")).
show
// +------+--------+
// |userId|bothVars|
// +------+--------+
// | 1| 0|
// | 1| 2|
// | 1| 5|
// | 1| 1|
// | 1| 2|
// | 1| 9|
// | 2| 1|
// | 2| 3|
// | 2| 4|
// | 2| 2|
// | 2| 3|
// | 2| 8|
// +------+--------+
Note that flatten is available on Spark 2.4+.
Use array_union and then use explode function.
scala> df.show(false)
+------+---------+---------+
|userId|varA |varB |
+------+---------+---------+
|1 |[0, 2, 5]|[1, 2, 9]|
|2 |[1, 3, 4]|[2, 3, 8]|
+------+---------+---------+
scala> df
.select($"userId",explode(array_union($"varA",$"varB")).as("bothVars"))
.show(false)
+------+--------+
|userId|bothVars|
+------+--------+
|1 |0 |
|1 |2 |
|1 |5 |
|1 |1 |
|1 |9 |
|2 |1 |
|2 |3 |
|2 |4 |
|2 |2 |
|2 |8 |
+------+--------+
array_union is available in Spark 2.4+
I want to transform this dataFrame;
+----+-------+---+---
|col1|col2 |RC1|RC2
+----+-------+---+---
|A |B | 1| 0
|C |D | 1| 1
+----+-------+---+---
to this!
+----+-------+------+
|col1|col2 |newCol|
+----+-------+------+
|A |B | RC1 |
|C |D | RC1 |
|C |D | RC2 |
+----+-------+---+
Tidyr seems to answer that well using the gather function, isn't there a possible solution with spark scala?
Use unpivot method stack to achieve this as given below,
val df=Seq(("A", "B", 1, 0), ("C", "D", 1, 1)).toDF("col1", "col2", "RC1", "RC2")
+----+----+---+---+
|col1|col2|RC1|RC2|
+----+----+---+---+
| A| B| 1| 0|
| C| D| 1| 1|
+----+----+---+---+
df.select($"col1", $"col2", expr("stack(2,'RC1', RC1, 'RC2', RC2) as (newCol,RC_VAL)")).where($"RC_VAL" =!= 0).drop("RC_VAL").show()
+----+----+------+
|col1|col2|newCol|
+----+----+------+
| A| B| RC1|
| C| D| RC1|
| C| D| RC2|
+----+----+------+
Check below code.
scala> df.show(false)
+----+----+---+---+
|col1|col2|rc1|rc2|
+----+----+---+---+
|A |B |1 |0 |
|C |D |1 |1 |
+----+----+---+---+
Build expression.
scala> val colExpr =
when($"rc1" === 1 && $"rc2" === 1,array(lit("RC1"),lit("RC2")))
.when($"rc1" === 1 && $"rc2" === 0, array(lit("RC1")))
.when($"rc1" === 0 && $"rc2" === 1, array(lit("RC2")))
Apply expression.
scala>
spark.time {
df
.select($"col1",$"col2",explode(colExpr).as("newcol"))
.show(false)
}
+----+----+------+
|col1|col2|newcol|
+----+----+------+
|A |B |RC1 |
|C |D |RC1 |
|C |D |RC2 |
+----+----+------+
Time taken: 914 ms
With a dataframe such as :
+-----+-----+-----+-----+-----+-----+
|old_a|new_a| a|old_b|new_b| b|
+-----+-----+-----+-----+-----+-----+
| 6| 7| true| 6| 6|false|
| 1| 1|false| 12| 8| true|
| 1| 2| true| 2| 8| true|
| 1| null| true| 2| 8| true|
+-----+-----+-----+-----+-----+-----+
note : 'a' is 'true' when 'new_a' is different from 'old_a', the same for 'b'
I'd like to add a json column, with some values from other columns, following that rule
"if 'a' is true, value of 'new_a' col must be added to the new json, and the same for 'b'",
which will produce following dataframe
+-----+-----+--------+-----+-----+--------+------------------------+
|old_a|new_a|a |old_b|new_b| b| json |
+-----+-----+--------+-----+-----+--------+------------------------+
| 6| 7| true| 6| 6| false| { "a" : 7 } |
| 1| 1| false| 12| 8| true| { "b" : 8 } |
| 1| 2| true| 2| 8| true| { "a" : 2, "b" : 8} |
| 1| null| true| 2| 8| true| { "a" : null, "b" : 8} |
+-----+-----+--------+-----+-----+--------+------------------------+
Is there a way to achieve that without UDFs ?
If not what would best way to write the UDF so it won't be too costly ?
Thanks
Use to_json & struct functions.
By default to_json function removes all null value columns, due to this reason I have converted new_a column datatype to string
new_a datatype integer
scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a |old_b|new_b|b |
+-----+-----+-----+-----+-----+-----+
|6 |7 |true |6 |6 |false|
|1 |1 |false|12 |8 |true |
|1 |2 |true |2 |8 |true |
|1 |null |true |2 |8 |true |
+-----+-----+-----+-----+-----+-----+
scala> df.printSchema
root
|-- old_a: integer (nullable = false)
|-- new_a: integer (nullable = true)
|-- a: boolean (nullable = false)
|-- old_b: integer (nullable = false)
|-- new_b: integer (nullable = false)
|-- b: boolean (nullable = false)
scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+---------------------+
|old_a|new_a|a |old_b|new_b|b |json |
+-----+-----+-----+-----+-----+-----+---------------------+
|6 |7 |true |6 |6 |false|{"new_a":7} |
|1 |1 |false|12 |8 |true |{"new_b":8} |
|1 |2 |true |2 |8 |true |{"new_a":2,"new_b":8}|
|1 |null |true |2 |8 |true |{"new_b":8} |
+-----+-----+-----+-----+-----+-----+---------------------+
new_a datatype string
scala> df.show(false)
+-----+-----+-----+-----+-----+-----+
|old_a|new_a|a |old_b|new_b|b |
+-----+-----+-----+-----+-----+-----+
|6 |7 |true |6 |6 |false|
|1 |1 |false|12 |8 |true |
|1 |2 |true |2 |8 |true |
|1 |null |true |2 |8 |true |
+-----+-----+-----+-----+-----+-----+
scala> df.printSchema
root
|-- old_a: integer (nullable = false)
|-- new_a: string (nullable = true)
|-- a: boolean (nullable = false)
|-- old_b: integer (nullable = false)
|-- new_b: integer (nullable = false)
|-- b: boolean (nullable = false)
scala> df.withColumn("json",when($"a" && $"b",to_json(struct($"new_a",$"new_b"))).when($"a",to_json(struct($"new_a"))).otherwise(to_json(struct($"new_b")))).show(false)
+-----+-----+-----+-----+-----+-----+--------------------------+
|old_a|new_a|a |old_b|new_b|b |json |
+-----+-----+-----+-----+-----+-----+--------------------------+
|6 |7 |true |6 |6 |false|{"new_a":"7"} |
|1 |1 |false|12 |8 |true |{"new_b":8} |
|1 |2 |true |2 |8 |true |{"new_a":"2","new_b":8} |
|1 |null |true |2 |8 |true |{"new_a":"null","new_b":8}|
+-----+-----+-----+-----+-----+-----+--------------------------+
A solution to generalize Srinivas solution, when we don't know the number of old/new column pairs
(note something I didn't mention is that col 'a' and 'b' where here to tell if the value changed between old a and new a (respectively b)
val df = Seq(
(null, "a", "b", "b"),
("a", null, "b", "b"),
("a", "a2", "b", "b"),
("a", "a2", "b", "b2"),
(null, null, "b", "b2"),
).toDF("old_a", "new_a","old_b", "new_b")
// replace null by empty string to not mess with the voluntary null value we set later
val df2 = df.na.fill("",df.columns)
df2.show()
val colNames = df2.columns.map(name => name.stripPrefix("old_").stripPrefix("new_")).distinct
val res = colNames.foldLeft(df2){(tempDF, colName) =>
tempDF.withColumn(colName,
when(col(s"old_$colName").equalTo(col(s"new_$colName")), null)
.otherwise(col(s"new_$colName"))
)
}
val cols: Array[Column] = colNames.map(col(_))
val resWithJson = res.withColumn("json", to_json(struct(cols:_*)))
output :
+-----+-----+-----+-----+
|old_a|new_a|old_b|new_b|
+-----+-----+-----+-----+
| | a| b| b|
| a| | b| b|
| a| a2| b| b|
| a| a2| b| b2|
| | | b| b2|
+-----+-----+-----+-----+
+-----+-----+-----+-----+----+----+-------------------+
|old_a|new_a|old_b|new_b|a |b |json |
+-----+-----+-----+-----+----+----+-------------------+
| |a |b |b |a |null|{"a":"a"} |
|a | |b |b | |null|{"a":""} |
|a |a2 |b |b |a2 |null|{"a":"a2"} |
|a |a2 |b |b2 |a2 |b2 |{"a":"a2","b":"b2"}|
| | |b |b2 |null|b2 |{"b":"b2"} |
+-----+-----+-----+-----+----+----+-------------------+
I created a dataframe in Spark, by groupby column1 and date and calculated the amount.
val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date |Amount
A |1-jul |1000
A |1-june |2000
A |1-May |2000
A |1-dec |3000
A |1-Nov |2000
B |1-jul |100
B |1-june |300
B |1-May |400
B |1-dec |300
Now, I want to add new column, with difference between amount of any two dates from the table.
You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.
But for that you need to change the date column as below so that it can be ordered.
+-------+------+--------------+------+
|Column1|Date |Date_Converted|Amount|
+-------+------+--------------+------+
|A |1-jul |2017-07-01 |1000 |
|A |1-june|2017-06-01 |2000 |
|A |1-May |2017-05-01 |2000 |
|A |1-dec |2017-12-01 |3000 |
|A |1-Nov |2017-11-01 |2000 |
|B |1-jul |2017-07-01 |100 |
|B |1-june|2017-06-01 |300 |
|B |1-May |2017-05-01 |400 |
|B |1-dec |2017-12-01 |300 |
+-------+------+--------------+------+
You can find the difference between previous month and current month by doing
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
.show(false)
You should have
+-------+------+--------------+------+------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |-100.0 |
|B |1-jul |2017-07-01 |100 |-200.0 |
|B |1-dec |2017-12-01 |300 |200.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |0.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |1000.0 |
|A |1-dec |2017-12-01 |3000 |1000.0 |
+-------+------+--------------+------+------------------------+
You can increase the lagging position for previous two months as
df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
.show(false)
which will give you
+-------+------+--------------+------+----------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |300.0 |
|B |1-jul |2017-07-01 |100 |-300.0 |
|B |1-dec |2017-12-01 |300 |0.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |2000.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |0.0 |
|A |1-dec |2017-12-01 |3000 |2000.0 |
+-------+------+--------------+------+----------------------------+
I hope the answer is helpful
Assumming those two dates belong to each group of your table
my imports :
import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}
Perpare the dataframe
scala> val seqRow = Seq(
| ("A","1- jul",1000),
| ("A","1-june",2000),
| ("A","1-May",2000),
| ("A","1-dec",3000),
| ("B","1-jul",100),
| ("B","1-june",300),
| ("B","1-May",400),
| ("B","1-dec",300))
seqRow: Seq[(String, String, Int)] = List((A,1- jul,1000), (A,1-june,2000), (A,1-May,2000), (A,1-dec,3000), (B,1-jul,100), (B,1-june,300), (B,1-May,400), (B,1-dec,300))
scala> val input_df = sc.parallelize(seqRow).toDF("column1","date","amount")
input_df: org.apache.spark.sql.DataFrame = [column1: string, date: string ... 1 more field]
Now write a UDF for your case,
scala> def calc_diff = udf((list : Seq[String],startMonth : String,endMonth : String) => {
| //get the month and their values
| val monthMap = list.map{str =>
| val splitText = str.split("\\$")
| val month = splitText(0).split("-")(1).trim
|
| (month.toLowerCase,splitText(1).toInt)
| }.toMap
|
| val stMnth = monthMap(startMonth)
| val endMnth = monthMap(endMonth)
| endMnth - stMnth
|
| })
calc_diff: org.apache.spark.sql.expressions.UserDefinedFunction
Now, Preparing the output
scala> val (month1 : String,month2 : String) = ("jul","dec")
month1: String = jul
month2: String = dec
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase)))
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 2 more fields]
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase))).drop('collect_val)
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 1 more field]
scala> req_df.orderBy('column1).show
+-------+----------+----+
|column1|sum_amount|diff|
+-------+----------+----+
| A| 8000|2000|
| B| 1100| 200|
+-------+----------+----+
Hope, this is what you want.
(table.filter($"Date".isin("1-jul", "1-dec"))
.groupBy("Column1")
.pivot("Date")
.agg(first($"Amount"))
.withColumn("diff", $"1-dec" - $"1-jul")
).show
+-------+-----+-----+----+
|Column1|1-dec|1-jul|diff|
+-------+-----+-----+----+
| B| 300| 100| 200|
| A| 3000| 1000|2000|
+-------+-----+-----+----+
I created a dataframe in Spark, by groupby column1 and date and calculated the amount.
val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date |Amount
A |1-jul |1000
A |1-june |2000
A |1-May |2000
A |1-dec |3000
A |1-Nov |2000
B |1-jul |100
B |1-june |300
B |1-May |400
B |1-dec |300
Now, I want to add new column, with difference between amount of any two dates from the table.
You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.
But for that you need to change the date column as below so that it can be ordered.
+-------+------+--------------+------+
|Column1|Date |Date_Converted|Amount|
+-------+------+--------------+------+
|A |1-jul |2017-07-01 |1000 |
|A |1-june|2017-06-01 |2000 |
|A |1-May |2017-05-01 |2000 |
|A |1-dec |2017-12-01 |3000 |
|A |1-Nov |2017-11-01 |2000 |
|B |1-jul |2017-07-01 |100 |
|B |1-june|2017-06-01 |300 |
|B |1-May |2017-05-01 |400 |
|B |1-dec |2017-12-01 |300 |
+-------+------+--------------+------+
You can find the difference between previous month and current month by doing
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
.show(false)
You should have
+-------+------+--------------+------+------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |-100.0 |
|B |1-jul |2017-07-01 |100 |-200.0 |
|B |1-dec |2017-12-01 |300 |200.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |0.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |1000.0 |
|A |1-dec |2017-12-01 |3000 |1000.0 |
+-------+------+--------------+------+------------------------+
You can increase the lagging position for previous two months as
df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
.show(false)
which will give you
+-------+------+--------------+------+----------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |300.0 |
|B |1-jul |2017-07-01 |100 |-300.0 |
|B |1-dec |2017-12-01 |300 |0.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |2000.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |0.0 |
|A |1-dec |2017-12-01 |3000 |2000.0 |
+-------+------+--------------+------+----------------------------+
I hope the answer is helpful
Assumming those two dates belong to each group of your table
my imports :
import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}
Perpare the dataframe
scala> val seqRow = Seq(
| ("A","1- jul",1000),
| ("A","1-june",2000),
| ("A","1-May",2000),
| ("A","1-dec",3000),
| ("B","1-jul",100),
| ("B","1-june",300),
| ("B","1-May",400),
| ("B","1-dec",300))
seqRow: Seq[(String, String, Int)] = List((A,1- jul,1000), (A,1-june,2000), (A,1-May,2000), (A,1-dec,3000), (B,1-jul,100), (B,1-june,300), (B,1-May,400), (B,1-dec,300))
scala> val input_df = sc.parallelize(seqRow).toDF("column1","date","amount")
input_df: org.apache.spark.sql.DataFrame = [column1: string, date: string ... 1 more field]
Now write a UDF for your case,
scala> def calc_diff = udf((list : Seq[String],startMonth : String,endMonth : String) => {
| //get the month and their values
| val monthMap = list.map{str =>
| val splitText = str.split("\\$")
| val month = splitText(0).split("-")(1).trim
|
| (month.toLowerCase,splitText(1).toInt)
| }.toMap
|
| val stMnth = monthMap(startMonth)
| val endMnth = monthMap(endMonth)
| endMnth - stMnth
|
| })
calc_diff: org.apache.spark.sql.expressions.UserDefinedFunction
Now, Preparing the output
scala> val (month1 : String,month2 : String) = ("jul","dec")
month1: String = jul
month2: String = dec
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase)))
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 2 more fields]
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase))).drop('collect_val)
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 1 more field]
scala> req_df.orderBy('column1).show
+-------+----------+----+
|column1|sum_amount|diff|
+-------+----------+----+
| A| 8000|2000|
| B| 1100| 200|
+-------+----------+----+
Hope, this is what you want.
(table.filter($"Date".isin("1-jul", "1-dec"))
.groupBy("Column1")
.pivot("Date")
.agg(first($"Amount"))
.withColumn("diff", $"1-dec" - $"1-jul")
).show
+-------+-----+-----+----+
|Column1|1-dec|1-jul|diff|
+-------+-----+-----+----+
| B| 300| 100| 200|
| A| 3000| 1000|2000|
+-------+-----+-----+----+