How to delete the rows Dynamically based on based on ID and Status ="Removed" in Scala - scala

Here I have some sample dataset and How to delete the ID's dynamically (no hardcoded values) based on (Column) Status = "Removed".
Sample Dataset:
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
2 New 02/05/20 30
2 Removed 03/05/20 20
3 In-Progress 09/05/20 50
3 Removed 09/05/20 20
4 New 10/05/20 20
4 Assigned 10/05/20 30
Expecting Result:-
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
4 New 10/05/20 20
4 Assigned 10/05/20 30
Thanks in Advance.

You can use either filter, not like/rlike to filter out records from the dataframe that have status = removed.
import org.apche.spark.sql.functions._
//assuming df is the dataframe
//using filter or where clause, trim to remove white spaces lower to convert to lower
val df1=df.filter(lower(trim(col("status"))) !== "removed")
//or by filtering status Removed filter won't match if you have mixed case
val df1=df.filter(col("status") !== "Removed")
//using not like
val df1=df.filter(!lower(col("status")).like("removed"))
//using not rlike
val df1=df.filter(!col("status").rlike(".*(?i)removed.*"))
Now df1 dataframe will have the required records in it.
UPDATE:
From Spark2.4:
We can use join or window clause for this case.
val df=Seq((1,"New","01/05/20","20"),(1,"Assigned","02/05/20","30"),(1,"In-Progress","02/05/20","50"),(2,"New","02/05/20","30"),(2,"Removed","03/05/20","20"),(3,"In-Progress","09/05/20","50"),(3,"Removed","09/05/20","20"),(4,"New","10/05/20","20"),(4,"Assigned","10/05/20","30")).toDF("ID","Status","Date","Amount")
import org.apache.spark.sql.expressions._
val df1=df.
groupBy("id").
agg(collect_list(lower(col("Status"))).alias("status_arr"))
//using array_contains function
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").show()
//without join using window clause
val w=Window.partitionBy("id").orderBy("Status").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("status_arr",collect_list(lower(col("status"))).over(w)).
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").
show()
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+
For Spark < 2.4:
val df1=df.groupBy("id").agg(concat_ws("",collect_list(lower(col("Status")))).alias("status_arr"))
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show()
//Using window functions
df.withColumn("status_arr",concat_ws("",collect_list(lower(col("status"))).over(w))).
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show(false)
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+

Assuming res0 is your dataset, you could do:
import spark.implicits._
val x = res0.where($"Status" !== "Removed")
x.show()
This will remove the rows with Status as removed, but wont give what you want to achieve based on what you have posted above.

Related

Spark Scala create a new column which contains addition of previous balance amount for each cid

Initial DF:
cid transAmt trasnDate
1 10 2-Aug
1 20 3-Aug
1 30 3-Aug
2 40 2-Aug
2 50 3-Aug
3 60 4-Aug
Output DF:
cid transAmt trasnDate sumAmt
1 10 2-Aug **10**
1 20 3-Aug **30**
1 30 3-Aug **60**
2 40 2-Aug **40**
2 50 3-Aug **90**
3 60 4-Aug **60**
I need a new column as sumAmt which has the addition for each cid
Use window sum function to get the cumulative sum.
Example:
df.show()
//+---+------+----------+
//|cid|Amount|transnDate|
//+---+------+----------+
//| 1| 10| 2-Aug|
//| 1| 20| 3-Aug|
//| 2| 40| 2-Aug|
//| 2| 50| 3-Aug|
//| 3| 60| 4-Aug|
//+---+------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
val w= Window.partitionBy("cid").orderBy("Amount","transnDate")
df.withColumn("sumAmt",sum(col("Amount")).over(w)).show()
//+---+------+----------+------+
//|cid|Amount|transnDate|sumAmt|
//+---+------+----------+------+
//| 1| 10| 2-Aug| 10|
//| 1| 20| 3-Aug| 30|
//| 3| 60| 4-Aug| 60|
//| 2| 40| 2-Aug| 40|
//| 2| 50| 3-Aug| 90|
//+---+------+----------+------+
Just use a simple window indicating rows between.
Window.unboundedPreceding meaning no lower limit
Window.currentRow meaning current row (pretty obvious)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val cidCategory = Window.partitionBy("cid")
.orderBy("transDate")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val result = df.withColumn("sumAmt", sum($"transAmt").over(cidCategory))
OUTPUT

How to find the symmetrical duplicate columns(2 columns) using spark dataframe in scala?

I have the below dataframe which has two columns.
Input dataframe:
col1,col2
1,2
2,3
7,0
2,1
In the above dataframe first row and the fourth row are symmetrical and should be considered only once. We can use either first or the fourth row in the output.
Possible output dataframes.
possibility 1:
col1,col2
2,3
7,0
2,1
possibility 2:
col1,col2
1,2
2,3
7,0
You can call dropDuplicates on a sorted array column:
val df2 = df.withColumn(
"arr",
sort_array(array(col("col1"), col("col2")))
).dropDuplicates("arr").drop("arr")
df2.show
+----+----+
|col1|col2|
+----+----+
| 2| 3|
| 1| 2|
| 7| 0|
+----+----+
You can use row_number over window partitionned by least and greatest values from col1 and col2 :
import org.apache.spark.sql.expressions.Window
val w = Window
.partitionBy(least($"col1", $"col2"), greatest($"col1", $"col2"))
.orderBy(lit(null))
val df1 = df
.withColumn("rn", row_number().over(w))
.filter("rn = 1")
.drop("rn")
df1.show
//+----+----+
//|col1|col2|
//+----+----+
//| 2| 3|
//| 1| 2|
//| 7| 0|
//+----+----+
You can also partition by a sorted array column :
val w = Window
.partitionBy(array_sort(array($"col1", $"col2")))
.orderBy(lit(null))

Modify DataFrame values against a particular value in Spark Scala

See my code:
val spark: SparkSession = SparkSession.builder().appName("ReadFiles").master("local[*]").getOrCreate()
import spark.implicits._
val data: DataFrame = spark.read.option("header", "true")
.option("inferschema", "true")
.csv("Resources/atom.csv")
data.show()
Data looks like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz null null
New list of values:
val nums: List[Int] = List(10,20)
I want to add these values where ID=4. So that DataFrame may look like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz 10 20
I wonder if it is possible. Any help will be appreciated. Thanks
It's possible, Use when otherwise statements for this case.
import org.apache.spark.sql.functions._
df.withColumn("newCol1", when(col("id") === 4, lit(nums(0))).otherwise(col("newCol1")))
.withColumn("newCol2", when(col("id") === 4, lit(nums(1))).otherwise(col("newCol2")))
.show()
//+---+----+----+-------+-------+
//| ID|Name|City|newCol1|newCol2|
//+---+----+----+-------+-------+
//| 1| Ali| lhr| null| null|
//| 2|Ahad| swl| 1| 10|
//| 3|Sana| khi| null| null|
//| 4| ABC| xyz| 10| 20|
//+---+----+----+-------+-------+

scala find group by id max date

I need to group by id and times and show max date
Id Key Times date
20 40 1 20190323
20 41 1 20191201
31 33 3 20191209
My output should be:
Id Key Times date
20 41 1 20191201
31 33 3 20191209
You can simply apply groupBy function to group by Id and then join with original dataset to get Key column to you resulting dataframe. Try following code,
//your original dataframe
val df = Seq((20,40,1,20190323),(20,41,1,20191201),(31,33,3,20191209))
.toDF("Id","Key","Times","date")
df.show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 20| 40| 1|20190323|
//| 20| 41| 1|20191201|
//| 31| 33| 3|20191209|
//+---+---+-----+--------+
//group by Id column
val maxDate = df.groupBy("Id").agg(max("date").as("maxdate"))
//join with original DF to get rest of the column
maxDate.join(df, Seq("Id"))
.where($"date" === $"maxdate")
.select("Id","Key","Times","date").show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 31| 33| 3|20191209|
//| 20| 41| 1|20191201|
//+---+---+-----+--------+

Get Unique records in Spark [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have a dataframe df as mentioned below:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
I want to create a new dataframe df2, which will have only unique customer ids, but as rule_name and rule_id columns are different for same customer in data, so I want to pick those records which has highest priority for the same customer, so my final outcome should be:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
Can anyone please help me to achieve it using Spark scala. Any help will be appericiated.
You basically want to select rows with extreme values in a column. This is a really common issue, so there's even a whole tag greatest-n-per-group. Also see this question SQL Select only rows with Max Value on a Column which has a nice answer.
Here's an example for your specific case.
Note that this could select multiple rows for a customer, if there are multiple rows for that customer with the same (minimum) priority value.
This example is in pyspark, but it should be straightforward to translate to Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
To create df2 you have to first order df by priority and then find unique customers by id. Like this:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
It would give you expected output:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
Corey beat me to it, but here's the Scala version:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
You will have to use min aggregation on priority column grouping the dataframe by customers and then inner join the original dataframe with the aggregated dataframe and select the required columns.
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
you should have the desired result