scala find group by id max date - scala

I need to group by id and times and show max date
Id Key Times date
20 40 1 20190323
20 41 1 20191201
31 33 3 20191209
My output should be:
Id Key Times date
20 41 1 20191201
31 33 3 20191209

You can simply apply groupBy function to group by Id and then join with original dataset to get Key column to you resulting dataframe. Try following code,
//your original dataframe
val df = Seq((20,40,1,20190323),(20,41,1,20191201),(31,33,3,20191209))
.toDF("Id","Key","Times","date")
df.show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 20| 40| 1|20190323|
//| 20| 41| 1|20191201|
//| 31| 33| 3|20191209|
//+---+---+-----+--------+
//group by Id column
val maxDate = df.groupBy("Id").agg(max("date").as("maxdate"))
//join with original DF to get rest of the column
maxDate.join(df, Seq("Id"))
.where($"date" === $"maxdate")
.select("Id","Key","Times","date").show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 31| 33| 3|20191209|
//| 20| 41| 1|20191201|
//+---+---+-----+--------+

Related

Spark Scala create a new column which contains addition of previous balance amount for each cid

Initial DF:
cid transAmt trasnDate
1 10 2-Aug
1 20 3-Aug
1 30 3-Aug
2 40 2-Aug
2 50 3-Aug
3 60 4-Aug
Output DF:
cid transAmt trasnDate sumAmt
1 10 2-Aug **10**
1 20 3-Aug **30**
1 30 3-Aug **60**
2 40 2-Aug **40**
2 50 3-Aug **90**
3 60 4-Aug **60**
I need a new column as sumAmt which has the addition for each cid
Use window sum function to get the cumulative sum.
Example:
df.show()
//+---+------+----------+
//|cid|Amount|transnDate|
//+---+------+----------+
//| 1| 10| 2-Aug|
//| 1| 20| 3-Aug|
//| 2| 40| 2-Aug|
//| 2| 50| 3-Aug|
//| 3| 60| 4-Aug|
//+---+------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
val w= Window.partitionBy("cid").orderBy("Amount","transnDate")
df.withColumn("sumAmt",sum(col("Amount")).over(w)).show()
//+---+------+----------+------+
//|cid|Amount|transnDate|sumAmt|
//+---+------+----------+------+
//| 1| 10| 2-Aug| 10|
//| 1| 20| 3-Aug| 30|
//| 3| 60| 4-Aug| 60|
//| 2| 40| 2-Aug| 40|
//| 2| 50| 3-Aug| 90|
//+---+------+----------+------+
Just use a simple window indicating rows between.
Window.unboundedPreceding meaning no lower limit
Window.currentRow meaning current row (pretty obvious)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val cidCategory = Window.partitionBy("cid")
.orderBy("transDate")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val result = df.withColumn("sumAmt", sum($"transAmt").over(cidCategory))
OUTPUT

How to find the symmetrical duplicate columns(2 columns) using spark dataframe in scala?

I have the below dataframe which has two columns.
Input dataframe:
col1,col2
1,2
2,3
7,0
2,1
In the above dataframe first row and the fourth row are symmetrical and should be considered only once. We can use either first or the fourth row in the output.
Possible output dataframes.
possibility 1:
col1,col2
2,3
7,0
2,1
possibility 2:
col1,col2
1,2
2,3
7,0
You can call dropDuplicates on a sorted array column:
val df2 = df.withColumn(
"arr",
sort_array(array(col("col1"), col("col2")))
).dropDuplicates("arr").drop("arr")
df2.show
+----+----+
|col1|col2|
+----+----+
| 2| 3|
| 1| 2|
| 7| 0|
+----+----+
You can use row_number over window partitionned by least and greatest values from col1 and col2 :
import org.apache.spark.sql.expressions.Window
val w = Window
.partitionBy(least($"col1", $"col2"), greatest($"col1", $"col2"))
.orderBy(lit(null))
val df1 = df
.withColumn("rn", row_number().over(w))
.filter("rn = 1")
.drop("rn")
df1.show
//+----+----+
//|col1|col2|
//+----+----+
//| 2| 3|
//| 1| 2|
//| 7| 0|
//+----+----+
You can also partition by a sorted array column :
val w = Window
.partitionBy(array_sort(array($"col1", $"col2")))
.orderBy(lit(null))

How to delete the rows Dynamically based on based on ID and Status ="Removed" in Scala

Here I have some sample dataset and How to delete the ID's dynamically (no hardcoded values) based on (Column) Status = "Removed".
Sample Dataset:
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
2 New 02/05/20 30
2 Removed 03/05/20 20
3 In-Progress 09/05/20 50
3 Removed 09/05/20 20
4 New 10/05/20 20
4 Assigned 10/05/20 30
Expecting Result:-
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
4 New 10/05/20 20
4 Assigned 10/05/20 30
Thanks in Advance.
You can use either filter, not like/rlike to filter out records from the dataframe that have status = removed.
import org.apche.spark.sql.functions._
//assuming df is the dataframe
//using filter or where clause, trim to remove white spaces lower to convert to lower
val df1=df.filter(lower(trim(col("status"))) !== "removed")
//or by filtering status Removed filter won't match if you have mixed case
val df1=df.filter(col("status") !== "Removed")
//using not like
val df1=df.filter(!lower(col("status")).like("removed"))
//using not rlike
val df1=df.filter(!col("status").rlike(".*(?i)removed.*"))
Now df1 dataframe will have the required records in it.
UPDATE:
From Spark2.4:
We can use join or window clause for this case.
val df=Seq((1,"New","01/05/20","20"),(1,"Assigned","02/05/20","30"),(1,"In-Progress","02/05/20","50"),(2,"New","02/05/20","30"),(2,"Removed","03/05/20","20"),(3,"In-Progress","09/05/20","50"),(3,"Removed","09/05/20","20"),(4,"New","10/05/20","20"),(4,"Assigned","10/05/20","30")).toDF("ID","Status","Date","Amount")
import org.apache.spark.sql.expressions._
val df1=df.
groupBy("id").
agg(collect_list(lower(col("Status"))).alias("status_arr"))
//using array_contains function
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").show()
//without join using window clause
val w=Window.partitionBy("id").orderBy("Status").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("status_arr",collect_list(lower(col("status"))).over(w)).
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").
show()
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+
For Spark < 2.4:
val df1=df.groupBy("id").agg(concat_ws("",collect_list(lower(col("Status")))).alias("status_arr"))
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show()
//Using window functions
df.withColumn("status_arr",concat_ws("",collect_list(lower(col("status"))).over(w))).
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show(false)
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+
Assuming res0 is your dataset, you could do:
import spark.implicits._
val x = res0.where($"Status" !== "Removed")
x.show()
This will remove the rows with Status as removed, but wont give what you want to achieve based on what you have posted above.

Modify DataFrame values against a particular value in Spark Scala

See my code:
val spark: SparkSession = SparkSession.builder().appName("ReadFiles").master("local[*]").getOrCreate()
import spark.implicits._
val data: DataFrame = spark.read.option("header", "true")
.option("inferschema", "true")
.csv("Resources/atom.csv")
data.show()
Data looks like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz null null
New list of values:
val nums: List[Int] = List(10,20)
I want to add these values where ID=4. So that DataFrame may look like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz 10 20
I wonder if it is possible. Any help will be appreciated. Thanks
It's possible, Use when otherwise statements for this case.
import org.apache.spark.sql.functions._
df.withColumn("newCol1", when(col("id") === 4, lit(nums(0))).otherwise(col("newCol1")))
.withColumn("newCol2", when(col("id") === 4, lit(nums(1))).otherwise(col("newCol2")))
.show()
//+---+----+----+-------+-------+
//| ID|Name|City|newCol1|newCol2|
//+---+----+----+-------+-------+
//| 1| Ali| lhr| null| null|
//| 2|Ahad| swl| 1| 10|
//| 3|Sana| khi| null| null|
//| 4| ABC| xyz| 10| 20|
//+---+----+----+-------+-------+

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+