Modify DataFrame values against a particular value in Spark Scala - scala

See my code:
val spark: SparkSession = SparkSession.builder().appName("ReadFiles").master("local[*]").getOrCreate()
import spark.implicits._
val data: DataFrame = spark.read.option("header", "true")
.option("inferschema", "true")
.csv("Resources/atom.csv")
data.show()
Data looks like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz null null
New list of values:
val nums: List[Int] = List(10,20)
I want to add these values where ID=4. So that DataFrame may look like:
ID Name City newCol1 newCol2
1 Ali lhr null null
2 Ahad swl 1 10
3 Sana khi null null
4 ABC xyz 10 20
I wonder if it is possible. Any help will be appreciated. Thanks

It's possible, Use when otherwise statements for this case.
import org.apache.spark.sql.functions._
df.withColumn("newCol1", when(col("id") === 4, lit(nums(0))).otherwise(col("newCol1")))
.withColumn("newCol2", when(col("id") === 4, lit(nums(1))).otherwise(col("newCol2")))
.show()
//+---+----+----+-------+-------+
//| ID|Name|City|newCol1|newCol2|
//+---+----+----+-------+-------+
//| 1| Ali| lhr| null| null|
//| 2|Ahad| swl| 1| 10|
//| 3|Sana| khi| null| null|
//| 4| ABC| xyz| 10| 20|
//+---+----+----+-------+-------+

Related

Spark Scala create a new column which contains addition of previous balance amount for each cid

Initial DF:
cid transAmt trasnDate
1 10 2-Aug
1 20 3-Aug
1 30 3-Aug
2 40 2-Aug
2 50 3-Aug
3 60 4-Aug
Output DF:
cid transAmt trasnDate sumAmt
1 10 2-Aug **10**
1 20 3-Aug **30**
1 30 3-Aug **60**
2 40 2-Aug **40**
2 50 3-Aug **90**
3 60 4-Aug **60**
I need a new column as sumAmt which has the addition for each cid
Use window sum function to get the cumulative sum.
Example:
df.show()
//+---+------+----------+
//|cid|Amount|transnDate|
//+---+------+----------+
//| 1| 10| 2-Aug|
//| 1| 20| 3-Aug|
//| 2| 40| 2-Aug|
//| 2| 50| 3-Aug|
//| 3| 60| 4-Aug|
//+---+------+----------+
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
val w= Window.partitionBy("cid").orderBy("Amount","transnDate")
df.withColumn("sumAmt",sum(col("Amount")).over(w)).show()
//+---+------+----------+------+
//|cid|Amount|transnDate|sumAmt|
//+---+------+----------+------+
//| 1| 10| 2-Aug| 10|
//| 1| 20| 3-Aug| 30|
//| 3| 60| 4-Aug| 60|
//| 2| 40| 2-Aug| 40|
//| 2| 50| 3-Aug| 90|
//+---+------+----------+------+
Just use a simple window indicating rows between.
Window.unboundedPreceding meaning no lower limit
Window.currentRow meaning current row (pretty obvious)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val cidCategory = Window.partitionBy("cid")
.orderBy("transDate")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
val result = df.withColumn("sumAmt", sum($"transAmt").over(cidCategory))
OUTPUT

How to delete the rows Dynamically based on based on ID and Status ="Removed" in Scala

Here I have some sample dataset and How to delete the ID's dynamically (no hardcoded values) based on (Column) Status = "Removed".
Sample Dataset:
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
2 New 02/05/20 30
2 Removed 03/05/20 20
3 In-Progress 09/05/20 50
3 Removed 09/05/20 20
4 New 10/05/20 20
4 Assigned 10/05/20 30
Expecting Result:-
ID Status Date Amount
1 New 01/05/20 20
1 Assigned 02/05/20 30
1 In-Progress 02/05/20 50
4 New 10/05/20 20
4 Assigned 10/05/20 30
Thanks in Advance.
You can use either filter, not like/rlike to filter out records from the dataframe that have status = removed.
import org.apche.spark.sql.functions._
//assuming df is the dataframe
//using filter or where clause, trim to remove white spaces lower to convert to lower
val df1=df.filter(lower(trim(col("status"))) !== "removed")
//or by filtering status Removed filter won't match if you have mixed case
val df1=df.filter(col("status") !== "Removed")
//using not like
val df1=df.filter(!lower(col("status")).like("removed"))
//using not rlike
val df1=df.filter(!col("status").rlike(".*(?i)removed.*"))
Now df1 dataframe will have the required records in it.
UPDATE:
From Spark2.4:
We can use join or window clause for this case.
val df=Seq((1,"New","01/05/20","20"),(1,"Assigned","02/05/20","30"),(1,"In-Progress","02/05/20","50"),(2,"New","02/05/20","30"),(2,"Removed","03/05/20","20"),(3,"In-Progress","09/05/20","50"),(3,"Removed","09/05/20","20"),(4,"New","10/05/20","20"),(4,"Assigned","10/05/20","30")).toDF("ID","Status","Date","Amount")
import org.apache.spark.sql.expressions._
val df1=df.
groupBy("id").
agg(collect_list(lower(col("Status"))).alias("status_arr"))
//using array_contains function
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").show()
//without join using window clause
val w=Window.partitionBy("id").orderBy("Status").rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.withColumn("status_arr",collect_list(lower(col("status"))).over(w)).
filter(!array_contains(col("status_arr"),"removed")).
drop("status_arr").
show()
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+
For Spark < 2.4:
val df1=df.groupBy("id").agg(concat_ws("",collect_list(lower(col("Status")))).alias("status_arr"))
df.alias("t1").join(df1.alias("t2"),Seq("id"),"inner").
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show()
//Using window functions
df.withColumn("status_arr",concat_ws("",collect_list(lower(col("status"))).over(w))).
filter(!col("status_arr").contains("removed")).
drop(col("status_arr")).
show(false)
//+---+-----------+--------+------+
//| ID| Status| Date|Amount|
//+---+-----------+--------+------+
//| 1| New|01/05/20| 20|
//| 1| Assigned|02/05/20| 30|
//| 1|In-Progress|02/05/20| 50|
//| 4| New|10/05/20| 20|
//| 4| Assigned|10/05/20| 30|
//+---+-----------+--------+------+
Assuming res0 is your dataset, you could do:
import spark.implicits._
val x = res0.where($"Status" !== "Removed")
x.show()
This will remove the rows with Status as removed, but wont give what you want to achieve based on what you have posted above.

Add a New column in pyspark Dataframe (alternative of .apply in pandas DF)

I have a pyspark.sql.DataFrame.dataframe df
id col1
1 abc
2 bcd
3 lal
4 bac
i want to add one more column flag in the df such that if id is odd no, flag should be 'odd' , if even 'even'
final output should be
id col1 flag
1 abc odd
2 bcd even
3 lal odd
4 bac even
I tried:
def myfunc(num):
if num % 2 == 0:
flag = 'EVEN'
else:
flag = 'ODD'
return flag
df['new_col'] = df['id'].map(lambda x: myfunc(x))
df['new_col'] = df['id'].apply(lambda x: myfunc(x))
It Gave me error : TypeError: 'Column' object is not callable
How do is use .apply ( as i use in pandas dataframe) in pyspark
pyspark doesn't provide apply, the alternative is to use withColumn function. Use withColumn to perform this operation.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
[1,"abc"],
[2,"bcd"],
[3,"lal"],
[4,"bac"]
],
["id","col1"]
)
df.show()
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| bcd|
| 3| lal|
| 4| bac|
+---+----+
df.withColumn(
"flag",
F.when(F.col("id")%2 == 0, F.lit("Even")).otherwise(
F.lit("odd"))
).show()
+---+----+----+
| id|col1|flag|
+---+----+----+
| 1| abc| odd|
| 2| bcd|Even|
| 3| lal| odd|
| 4| bac|Even|
+---+----+----+

scala find group by id max date

I need to group by id and times and show max date
Id Key Times date
20 40 1 20190323
20 41 1 20191201
31 33 3 20191209
My output should be:
Id Key Times date
20 41 1 20191201
31 33 3 20191209
You can simply apply groupBy function to group by Id and then join with original dataset to get Key column to you resulting dataframe. Try following code,
//your original dataframe
val df = Seq((20,40,1,20190323),(20,41,1,20191201),(31,33,3,20191209))
.toDF("Id","Key","Times","date")
df.show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 20| 40| 1|20190323|
//| 20| 41| 1|20191201|
//| 31| 33| 3|20191209|
//+---+---+-----+--------+
//group by Id column
val maxDate = df.groupBy("Id").agg(max("date").as("maxdate"))
//join with original DF to get rest of the column
maxDate.join(df, Seq("Id"))
.where($"date" === $"maxdate")
.select("Id","Key","Times","date").show()
//output
//+---+---+-----+--------+
//| Id|Key|Times| date|
//+---+---+-----+--------+
//| 31| 33| 3|20191209|
//| 20| 41| 1|20191201|
//+---+---+-----+--------+

Spark: Add column to dataframe conditionally

I am trying to take my input data:
A B C
--------------
4 blah 2
2 3
56 foo 3
And add a column to the end based on whether B is empty or not:
A B C D
--------------------
4 blah 2 1
2 3 0
56 foo 3 1
I can do this easily by registering the input dataframe as a temp table, then typing up a SQL query.
But I'd really like to know how to do this with just Scala methods and not having to type out a SQL query within Scala.
I've tried .withColumn, but I can't get that to do what I want.
Try withColumn with the function when as follows:
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._ // for `toDF` and $""
import org.apache.spark.sql.functions._ // for `when`
val df = sc.parallelize(Seq((4, "blah", 2), (2, "", 3), (56, "foo", 3), (100, null, 5)))
.toDF("A", "B", "C")
val newDf = df.withColumn("D", when($"B".isNull or $"B" === "", 0).otherwise(1))
newDf.show() shows
+---+----+---+---+
| A| B| C| D|
+---+----+---+---+
| 4|blah| 2| 1|
| 2| | 3| 0|
| 56| foo| 3| 1|
|100|null| 5| 0|
+---+----+---+---+
I added the (100, null, 5) row for testing the isNull case.
I tried this code with Spark 1.6.0 but as commented in the code of when, it works on the versions after 1.4.0.
My bad, I had missed one part of the question.
Best, cleanest way is to use a UDF.
Explanation within the code.
// create some example data...BY DataFrame
// note, third record has an empty string
case class Stuff(a:String,b:Int)
val d= sc.parallelize(Seq( ("a",1),("b",2),
("",3) ,("d",4)).map { x => Stuff(x._1,x._2) }).toDF
// now the good stuff.
import org.apache.spark.sql.functions.udf
// function that returns 0 is string empty
val func = udf( (s:String) => if(s.isEmpty) 0 else 1 )
// create new dataframe with added column named "notempty"
val r = d.select( $"a", $"b", func($"a").as("notempty") )
scala> r.show
+---+---+--------+
| a| b|notempty|
+---+---+--------+
| a| 1| 1111|
| b| 2| 1111|
| | 3| 0|
| d| 4| 1111|
+---+---+--------+
How about something like this?
val newDF = df.filter($"B" === "").take(1) match {
case Array() => df
case _ => df.withColumn("D", $"B" === "")
}
Using take(1) should have a minimal hit