Spark: Row filter based on Column value - scala

I have millions of rows as dataframe like this:
val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
Now I want to divide this data into three separate dataFrame like this:
Only ACTIVE ids (like id2), say activeDF
Only INACTIVE ids (like id3), say inactiveDF
Having both ACTIVE and INACTIVE as status, say bothDF
How can I calculate activeDF and inactiveDF?
I know that bothDF can be calculated like
df.select("id").distinct.except(activeDF).except(inactiveDF)
, but this will involve shuffling (as 'distinct' operation required same). Is there any better way to calculate bothDF
Versions:
Spark : 2.2.1
Scala : 2.11

The most elegant solution is to pivot on status
val counts = df
.groupBy("id")
.pivot("status", Seq("ACTIVE", "INACTIVE"))
.count
or equivalent direct agg
val counts = df
.groupBy("id")
.agg(
count(when($"status" === "ACTIVE", true)) as "ACTIVE",
count(when($"status" === "INACTIVE", true)) as "INACTIVE"
)
followed by a simple CASE ... WHEN:
val result = counts.withColumn(
"status",
when($"ACTIVE" === 0, "INACTIVE")
.when($"inactive" === 0, "ACTIVE")
.otherwise("BOTH")
)
result.show
+---+------+--------+--------+
| id|ACTIVE|INACTIVE| status|
+---+------+--------+--------+
|id3| 0| 2|INACTIVE|
|id1| 1| 2| BOTH|
|id2| 1| 0| ACTIVE|
+---+------+--------+--------+
Later you can separate the result with filters or dump to disk with source that supports partitionBy (How to split a dataframe into dataframes with same column values?).

just another way - groupBy, collect as set and then if the size of the set is 1, it is either active or inactive only, else both
scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE |
|id5|ACTIVE |
|id6|INACTIVE|
|id7|ACTIVE |
|id7|INACTIVE|
+---+--------+
scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]
scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE] |
|id5|[ACTIVE] |
|id6|[INACTIVE] |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE] |
|id4|[ACTIVE] |
+---+------------------+
scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus |status |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH |
|id3|[INACTIVE] |INACTIVE|
|id5|[ACTIVE] |ACTIVE |
|id6|[INACTIVE] |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH |
|id2|[ACTIVE] |ACTIVE |
|id4|[ACTIVE] |ACTIVE |
+---+------------------+--------+

Related

Scala - Return the largest string within each group

DataSet:
+---+--------+
|age| name|
+---+--------+
| 33| Will|
| 26|Jean-Luc|
| 55| Hugh|
| 40| Deanna|
| 68| Quark|
| 59| Weyoun|
| 37| Gowron|
| 54| Will|
| 38| Jadzia|
| 27| Hugh|
+---+--------+
Here is my attempt but it just returns the size of the largest string rather than the largest string:
AgeName.groupBy("age")
.agg(max(length(AgeName("name")))).show()
The usual row_number trick should work if you specify the Window correctly. Using #LeoC's example,
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
val df2 = df.withColumn(
"rownum",
expr("row_number() over (partition by age order by length(name) desc)")
).filter("rownum = 1").drop("rownum")
df2.show
+---+---------+
|age| name|
+---+---------+
| 22|Alexander|
| 35| Michelle|
+---+---------+
Here's one approach using Spark higher-order function, aggregate, as shown below:
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
df.
groupBy("age").agg(collect_list("name").as("names")).
withColumn(
"longest_name",
expr("aggregate(names, '', (acc, x) -> case when length(acc) < length(x) then x else acc end)")
).
show(false)
// +---+----------------------------+------------+
// |age|names |longest_name|
// +---+----------------------------+------------+
// |22 |[Jennifer, Alexander, Celia]|Alexander |
// |35 |[John, Michelle] |Michelle |
// +---+----------------------------+------------+
Note that higher-order functions are available only on Spark 2.4+.
object BasicDatasetTest {
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local[*]")
.appName("BasicDatasetTest")
.getOrCreate()
val pairs=List((33,"Will"),(26,"Jean-Luc"),
(55, "Hugh"),
(26, "Deanna"),
(26, "Quark"),
(55, "Weyoun"),
(33, "Gowron"),
(55, "Will"),
(26, "Jadzia"),
(27, "Hugh"))
val schema=new StructType(Array(
StructField("age",IntegerType,false),
StructField("name",StringType,false))
)
val dataRDD=spark.sparkContext.parallelize(pairs).map(record=>Row(record._1,record._2))
val dataset=spark.createDataFrame(dataRDD,schema)
val ageNameGroup=dataset.groupBy("age","name")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageNameGroup.printSchema()
val ageGroup=dataset.groupBy("age")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageGroup.printSchema()
ageGroup.createOrReplaceTempView("age_group")
ageNameGroup.createOrReplaceTempView("age_name_group")
spark.sql("select ag.age,ang.name from age_group as ag, age_name_group as ang " +
"where ag.age=ang.age and ag.length=ang.length")
.show()
}
}

Find the facility with the longest interval without accidents using Apache Spark SQL

I have the next dataset:
|facility|date |accidents|
| foo |2019-01-01|1 |
| foo |2019-01-02|null |
| foo |2019-01-03|null |
| foo |2019-01-04|2 |
| bar |2019-01-01|1 |
| bar |2019-01-02|null |
| bar |2019-01-03|3 |
And the goal is to find a facility with the longest continuous period of time without accidents:
|facility|startDate |interval|
|foo |2019-01-02|2 |
Is it possible to do this using Spark SQL? Thanks
P.S. Code sample:
case class FacilityRecord(name: String, date: java.sql.Date, accidents: Option[Int])
case class IntervalWithoutAccidents(name: String, startDate: java.sql.Date, interval: Int)
implicit val spark: SparkSession = SparkSession.builder
.appName("Test")
.master("local")
.getOrCreate()
import spark.implicits._
val facilityRecords = Seq(
FacilityRecord("foo", Date.valueOf("2019-01-01"), Some(1)),
FacilityRecord("foo", Date.valueOf("2019-01-02"), None),
FacilityRecord("foo", Date.valueOf("2019-01-03"), None),
FacilityRecord("foo", Date.valueOf("2019-01-04"), Some(2)),
FacilityRecord("bar", Date.valueOf("2019-01-01"), Some(1)),
FacilityRecord("bar", Date.valueOf("2019-01-02"), None),
FacilityRecord("bar", Date.valueOf("2019-01-03"), Some(3))
)
val facilityRecordsDataset = spark.createDataset(facilityRecords)
facilityRecordsDataset.show()
val intervalWithoutAccidents: IntervalWithoutAccidents = ??? // TODO: find the interval
val expectedInterval = IntervalWithoutAccidents("foo", startDate = Date.valueOf("2019-01-02"), interval = 2)
assert(expectedInterval == intervalWithoutAccidents)
println(intervalWithoutAccidents)
Here's a 2-step approach:
Create column accident_date and for each facility compute interval value in every row between the current date and the next accident date using Window function first.
Compute the max interval per facility using Window function max and filter for the rows that have the max interval value.
Example code below:
import java.sql.Date
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = Seq(
("foo", Date.valueOf("2019-01-01"), Some(1)),
("foo", Date.valueOf("2019-01-02"), None),
("foo", Date.valueOf("2019-01-03"), None),
("foo", Date.valueOf("2019-01-04"), Some(2)),
("bar", Date.valueOf("2019-01-01"), Some(1)),
("bar", Date.valueOf("2019-01-02"), None),
("bar", Date.valueOf("2019-01-03"), Some(3))
).toDF("facility", "date", "accidents")
val win = Window.partitionBy($"facility").orderBy($"date").
rowsBetween(0, Window.unboundedFollowing)
Step #1: Compute interval
val df2 = df.
withColumn("accident_date", when($"accidents".isNotNull, $"date")).
withColumn("interval",
datediff(first($"accident_date", ignoreNulls=true).over(win), $"date")
)
df2.show
// +--------+----------+---------+-------------+--------+
// |facility| date|accidents|accident_date|interval|
// +--------+----------+---------+-------------+--------+
// | bar|2019-01-01| 1| 2019-01-01| 0|
// | bar|2019-01-02| null| null| 1|
// | bar|2019-01-03| 3| 2019-01-03| 0|
// | foo|2019-01-01| 1| 2019-01-01| 0|
// | foo|2019-01-02| null| null| 2|
// | foo|2019-01-03| null| null| 1|
// | foo|2019-01-04| 2| 2019-01-04| 0|
// +--------+----------+---------+-------------+--------+
Step #2: Compute max interval
df2.select($"facility", $"date".as("start_date"),
max($"interval").over(Window.partitionBy($"facility")).as("max_interval")
).
where($"interval" === $"max_interval").
show
// +--------+----------+------------+
// |facility|start_date|max_interval|
// +--------+----------+------------+
// | bar|2019-01-02| 1|
// | foo|2019-01-02| 2|
// +--------+----------+------------+
Well, we can derive this just with One-step **SQL analytics function **lag()over(Window) and first_value()over(window).
Here is the code (added one extra test where 'foo' had another accident on "2019-01-05" and also no-accident on "2019-01-06")
val accidentDf = Seq(
("foo", Date.valueOf("2019-01-01"), Some(1)),
("foo", Date.valueOf("2019-01-02"), None),
("foo", Date.valueOf("2019-01-03"), None),
("foo", Date.valueOf("2019-01-04"), Some(2)),
("bar", Date.valueOf("2019-01-01"), Some(1)),
("bar", Date.valueOf("2019-01-02"), None),
("bar", Date.valueOf("2019-01-03"), Some(3)),
("foo", Date.valueOf("2019-01-05"), Some(3)),
("foo", Date.valueOf("2019-01-06"), None)
).toDF("facility", "date", "accidents")
accidentDf.createOrReplaceTempView("accident_table")
now we try to find for a particular row in a facility partition when did the last accident happened . If no accident happened then we keep the last_accident_report_date as first_value else call the date in the row as when the accident happened
Then we see for each rows what is the datediff on date and last row's last_accident_report_date
Then we select where datediff is highest.
Here is the query
val sparkSql="""select facility,date,accidents ,
lag(CASE
WHEN accidents is NULL
then first(date) over(partition by facility order by date)
else date END ,1)
over(partition by facility order by date) as last_accident_report_date ,
datediff(date,
lag(CASE WHEN accidents is NULL then first(date)
over(partition by facility order by date) else date END ,1)
over(partition by facility order by date))
as no_accident_days_rank from accident_table order by no_accident_days_rank desc, facility"""
RESULT
scala> spark.sql(sparkSql).show(20,false)
+--------+----------+---------+-------------------------+---------------------+
|facility|date |accidents|last_accident_report_date|no_accident_days_rank|
+--------+----------+---------+-------------------------+---------------------+
|foo |2019-01-04|2 |2019-01-01 |3 |
|bar |2019-01-03|3 |2019-01-01 |2 |
|foo |2019-01-03|null |2019-01-01 |2 |
|bar |2019-01-02|null |2019-01-01 |1 |
|foo |2019-01-02|null |2019-01-01 |1 |
|foo |2019-01-06|null |2019-01-05 |1 |
|foo |2019-01-05|3 |2019-01-04 |1 |
|bar |2019-01-01|1 |null |null |
|foo |2019-01-01|1 |null |null |
+--------+----------+---------+-------------------------+---------------------+

How to split Comma-separated multiple columns into multiple rows?

I have a data-frame with N fields as mentioned below. The number of columns and length of the value will vary.
Input Table:
+--------------+-----------+-----------------------+
|Date |Amount |Status |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST |
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------------------+
I have to convert it into the below format with one sequence column.
Expected Output Table:
+-------------+------+---------+
|Date |Amount|Status| Sequence|
+------+------+------+---------+
|2019 |100 |IN | 1 |
|2018 |200 |PRE | 2 |
|2017 |300 |POST | 3 |
|2018 |73 |IN | 1 |
|2018 |56 |IN | 1 |
|2017 |89 |PRE | 2 |
+-------------+------+---------+
I have Tried using explode but explode only take one array at a time.
var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF
var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show
Does someone know how I can do it? Thank you for your help.
Here is another approach using posexplode for each column and joining all produced dataframes into one:
import org.apache.spark.sql.functions.{posexplode, monotonically_increasing_id, col}
val df = Seq(
(Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
(Seq("2018"), Seq("73"), Seq("IN")),
(Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)
df.columns.filter(_ != "idx").map{
c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show
Output:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
You can achieve this by using Dataframe inbuilt functions arrays_zip,split,posexplode
Explanation:
scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")
scala>:paste
df.selectExpr("""posexplode(
arrays_zip(
split(date,","), //split date string with ',' to create array
split(amount,","),
split(status,","))) //zip arrays
as (p,colum) //pos explode on zip arrays will give position and column value
""")
.selectExpr("colum.`0` as Date", //get 0 column as date
"colum.`1` as Amount",
"colum.`2` as Status",
"p+1 as Sequence") //add 1 to the position value
.show()
Result:
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Yes, I personally also find explode a bit annoying and in your case I would probably go with a flatMap instead:
import spark.implicits._
import org.apache.spark.sql.Row
val df = spark.sparkContext.parallelize(Seq((Seq(2019,2018,2017), Seq(100,200,300), Seq("IN","PRE","POST")),(Seq(2018), Seq(73), Seq("IN")),(Seq(2018,2017), Seq(56,89), Seq("IN","PRE")))).toDF()
val transformedDF = df
.flatMap{case Row(dates: Seq[Int], amounts: Seq[Int], statuses: Seq[String]) =>
dates.indices.map(index => (dates(index), amounts(index), statuses(index), index+1))}
.toDF("Date", "Amount", "Status", "Sequence")
Output:
df.show
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019| 100| IN| 1|
|2018| 200| PRE| 2|
|2017| 300| POST| 3|
|2018| 73| IN| 1|
|2018| 56| IN| 1|
|2017| 89| PRE| 2|
+----+------+------+--------+
Assuming the number of data elements in each column is the same for each row:
First, I recreated your DataFrame
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")
Next, I split the rows and added a sequence value, then converted back to a DF:
val exploded = df.rdd.flatMap(row => {
val buffer = new ListBuffer[(String, String, String, Int)]
val dateSplit = row(0).toString.split("\\,", -1)
val amountSplit = row(1).toString.split("\\,", -1)
val statusSplit = row(2).toString.split("\\,", -1)
val seqSize = dateSplit.size
for(i <- 0 to seqSize-1)
buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
buffer.toList
}).toDF((df.columns:+"Sequence"): _*)
I'm sure there are other ways to do it without first converting the DF to an RDD, but this will still result with a DF with the correct answer.
Let me know if you have any questions.
I took advantage of the transpose to zip all Sequences by position and then did a posexplode. Selects on dataFrames are dynamic to satisfy the condition: The number of columns and length of the value will vary in the question.
import org.apache.spark.sql.functions._
val df = Seq(
("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
("2018", "73", "IN"),
("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]
scala> df.show(false)
+--------------+-----------+-----------+
|Date |Amount |Status |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018 |73 |IN |
|2018,2017 |56,89 |IN,PRE |
+--------------+-----------+-----------+
scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]
scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))
scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]
scala> df2.show(false)
+------------------+---------------+---------------+
|Date |Amount |Status |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018] |[73] |[IN] |
|[2018, 2017] |[56, 89] |[IN, PRE] |
+------------------+---------------+---------------+
scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]
scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date |Amount |Status |allcols |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018] |[73] |[IN] |[[2018], [73], [IN]] |
|[2018, 2017] |[56, 89] |[IN, PRE] |[[2018, 2017], [56, 89], [IN, PRE]] |
+------------------+---------------+---------------+------------------------------------------------------+
scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]
scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab |pos|col |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0 |[2019, 100, IN] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1 |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2 |[2017, 300, POST]|
|[[2018, 73, IN]] |0 |[2018, 73, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |0 |[2018, 56, IN] |
|[[2018, 56, IN], [2017, 89, PRE]] |1 |[2017, 89, PRE] |
+------------------------------------------------------+---+-----------------+
scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)
scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100 |IN |1 |
|2018|200 |PRE |2 |
|2017|300 |POST |3 |
|2018|73 |IN |1 |
|2018|56 |IN |1 |
|2017|89 |PRE |2 |
+----+------+------+--------+
This is why I love spark-core APIs. Just with the help of map and flatMap you can handle many problems. Just pass your df and the instance of SQLContext to below method and it will give the desired result -
def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={
val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))
val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
val rdd2 = rdd1.map{
case ((a,b),c) => (a,b,c)
}
sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}

Is it possible to ignore null values when using LEAD window function in Spark?

My dataframe like this
id value date
1 100 2017
1 null 2016
1 20 2015
1 100 2014
I would like to get most recent previous value but ignoring null
id value date recent value
1 100 2017 20
1 null 2016 20
1 20 2015 100
1 100 2014 null
Is there any way to ignore null values while using lead window function?
Is it possible to ignore null values when using lead window function in Spark
It is not.
I would like to get most recent value but ignoring null
Just use last (or first) with ignoreNulls:
def last(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017), (1, None, 2016), (1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
df.withColumn(
"last_value",
last("value", true).over(Window.partitionBy("id").orderBy("date"))
).show
+---+-----+----+----------+
| id|value|date|last_value|
+---+-----+----+----------+
| 1| 100|2014| 100|
| 1| 20|2015| 20|
| 1| null|2016| 20|
| 1| 100|2017| 100|
+---+-----+----+----------+
Spark 3.2+ provides ignoreNulls inside lead and lag in Scala.
lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Test input:
import org.apache.spark.sql.expressions.Window
val df = Seq[(Integer, Integer, Integer)](
(1, 100, 2017),
(1, null, 2016),
(1, 20, 2015),
(1, 100, 2014)
).toDF("id", "value", "date")
lead:
val w = Window.partitionBy("id").orderBy(desc("date"))
val df2 = df.withColumn("lead_val", lead($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2017| 20|
// | 1| null|2016| 20|
// | 1| 20|2015| 100|
// | 1| 100|2014| null|
// +---+-----+----+--------+
lag:
val w = Window.partitionBy("id").orderBy("date")
val df2 = df.withColumn("lead_val", lag($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2014| null|
// | 1| 20|2015| 100|
// | 1| null|2016| 20|
// | 1| 100|2017| 20|
// +---+-----+----+--------+
You could do it in two steps:
Create a table with non null values
Join on the original table
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017),
(1, None, 2016),
(1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
// Step 1
val filledDf = df
.where($"value".isNotNull)
.withColumnRenamed("value", "recent_value")
// Step 2
val window: WindowSpec = Window.partitionBy("l.id", "l.date").orderBy($"r.date".desc)
val finalDf = df.as("l")
.join(filledDf.as("r"), $"l.id" === $"r.id" && $"l.date" > $"r.date", "left")
.withColumn("rn", row_number().over(window))
.where($"rn" === 1)
.select("l.id", "l.date", "value", "recent_value")
finalDf.orderBy($"date".desc).show
+---+----+-----+------------+
| id|date|value|recent_value|
+---+----+-----+------------+
| 1|2017| 100| 20|
| 1|2016| null| 20|
| 1|2015| 20| 100|
| 1|2014| 100| null|
+---+----+-----+------------+

perform join on multiple DataFrame in spark

I have 3dataframes generated from 3 different processes.
Every dataframe is having columns of same name.
My dataframe looks like this
id val1 val2 val3 val4
1 null null null null
2 A2 A21 A31 A41
id val1 val2 val3 val4
1 B1 B21 B31 B41
2 null null null null
id val1 val2 val3 val4
1 C1 C2 C3 C4
2 C11 C12 C13 C14
Out of these 3 dataframes, i want to create two dataframes, (final and consolidated).
For final, order of preferences -
dataFrame 1 > Dataframe 2 > Dataframe 3
If a result is there in dataframe 1(val1 != null), i will store that row in final dataframe.
My final result should be :
id finalVal1 finalVal2 finalVal3 finalVal4
1 B1 B21 B31 B41
2 A2 A21 A31 A41
Consolidated Dataframe will store results from all 3.
How can i do that efficiently?
If I understood you correctly, for each row you want to find out the first non-null values, first by looking into the first table, then the second table, then the third table.
You simply need to join these three tables based on the id and then use the coalesce function to get the first non-null element
import org.apache.spark.sql.functions._
val df1 = sc.parallelize(Seq(
(1,null,null,null,null),
(2,"A2","A21","A31", "A41"))
).toDF("id", "val1", "val2", "val3", "val4")
val df2 = sc.parallelize(Seq(
(1,"B1","B21","B31", "B41"),
(2,null,null,null,null))
).toDF("id", "val1", "val2", "val3", "val4")
val df3 = sc.parallelize(Seq(
(1,"C1","C2","C3","C4"),
(2,"C11","C12","C13", "C14"))
).toDF("id", "val1", "val2", "val3", "val4")
val consolidated = df1.join(df2, "id").join(df3, "id").select(
df1("id"),
coalesce(df1("val1"), df2("val1"), df3("val1")).as("finalVal1"),
coalesce(df1("val2"), df2("val2"), df3("val2")).as("finalVal2"),
coalesce(df1("val3"), df2("val3"), df3("val3")).as("finalVal3"),
coalesce(df1("val4"), df2("val4"), df3("val4")).as("finalVal4")
)
Which gives you the expected output
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1| B21| B31| B41|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
Edit: New solution with partially null lines. It avoids joins, but uses a window function and a distinct...
case class a(id:Int,val1:String,val2:String,val3:String,val4:String)
val df1 = sc.parallelize(List(
a(1,null,null,null,null),
a(2,"A2","A21","A31","A41"),
a(3,null,null,null,null))).toDF()
val df2 = sc.parallelize(List(
a(1,"B1",null,"B31","B41"),
a(2,null,null,null,null),
a(3,null,null,null,null))).toDF()
val df3 = sc.parallelize(List(
a(1,"C1","C2","C3","C4"),
a(2,"C11","C12","C13","C14"),
a(3,"C11","C12","C13","C14"))).toDF()
val anyNotNull = df1.columns.tail.map(c => col(c).isNotNull).reduce(_ || _)
val consolidated = {
df1
.filter(anyNotNull)
.withColumn("foo",lit(1))
.unionAll(df2.filter(anyNotNull).withColumn("foo",lit(2)))
.unionAll(df3.filter(anyNotNull).withColumn("foo",lit(3)))
}
scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1|null| B31| B41|
| 1| B1| C2| B31| B41|
| 3| C11| C12| C13| C14|
| 2| A2| A21| A31| A41|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
val w = Window.partitionBy('id).orderBy('foo)
val coalesced = col("id") +: df1.columns.tail.map(c => first(col(c),true).over(w).as(c))
val finalDF = consolidated.select(coalesced:_*).na.drop.distinct
scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1| C2| B31| B41|
| 3| C11| C12| C13| C14|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
Old solution:
If you have only full lines of null or no null at all, you can do this (edit: the advantage over the other solution is that you avoid the distinct)
data:
case class a(id:Int,val1:String,val2:String,val3:String,val4:String)
val df1 = sc.parallelize(List(
a(1,null,null,null,null),
a(2,"A2","A21","A31","A41"),
a(3,null,null,null,null))).toDF()
val df2 = sc.parallelize(List(
a(1,"B1","B21","B31","B41"),
a(2,null,null,null,null),
a(3,null,null,null,null))).toDF()
val df3 = sc.parallelize(List(
a(1,"C1","C2","C3","C4"),
a(2,"C11","C12","C13","C14"),
a(3,"C11","C12","C13","C14"))).toDF()
consolidated:
val consolidated = {
df1.na.drop.withColumn("foo",lit(1))
.unionAll(df2.na.drop.withColumn("foo",lit(2)))
.unionAll(df3.na.drop.withColumn("foo",lit(3)))
}
scala> consolidated.show()
+---+----+----+----+----+---+
| id|val1|val2|val3|val4|foo|
+---+----+----+----+----+---+
| 2| A2| A21| A31| A41| 1|
| 1| B1| B21| B31| B41| 2|
| 1| C1| C2| C3| C4| 3|
| 2| C11| C12| C13| C14| 3|
| 3| C11| C12| C13| C14| 3|
+---+----+----+----+----+---+
Final
val w = Window.partitionBy('id).orderBy('foo)
val finalDF = consolidated
.withColumn("foo2",rank().over(w))
.filter('foo2===1)
.drop("foo").drop("foo2")
scala> finalDF.show()
+---+----+----+----+----+
| id|val1|val2|val3|val4|
+---+----+----+----+----+
| 1| B1| B21| B31| B41|
| 3| C11| C12| C13| C14|
| 2| A2| A21| A31| A41|
+---+----+----+----+----+
Below is an example of joining six tables/dataframes (not using SQL)
retail_db is a well known sample DB, anyone can get it from Google
Problem: //Get all customers from TX who bought fitness items
val df_customers = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "customers").option("user", "root").option("password", "root").load()
val df_products = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "products").option("user", "root").option("password", "root").load()
val df_orders = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "orders"). option("user", "root").option("password", "root").load()
val df_order_items = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_items").option("user", "root").option("password", "root").load()
val df_categories = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "categories").option("user", "root").option("password", "root").load()
val df_departments = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "departments").option("user", "root").option("password", "root").load()
val df_order_items_all = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost/retail_db?useSSL=false").option("driver", "com.mysql.jdbc.Driver").option("dbtable", "order_all").option("user", "root").option("password", "root").load()
val jeCustOrd=df_customers.col("customer_id")===df_orders.col("order_customer_id")
val jeOrdItem=df_orders.col("order_id")===df_order_items.col("order_item_order_id")
val jeProdOrdItem=df_products.col("product_id")===df_order_items.col("order_item_product_id")
val jeProdCat=df_products.col("product_category_id")===df_categories.col("category_id")
val jeCatDept=df_categories.col("category_department_id")===df_departments.col("department_id")
df_customers.where("customer_state = 'TX'").join(df_orders,jeCustOrd).join(df_order_items,jeOrdItem).join(df_products,jeProdOrdItem).join(df_categories,jeProdCat).join(df_departments,jeCatDept).filter("department_name='Fitness'")
.select("customer_id","customer_fname","customer_lname", "customer_street","customer_city","customer_state","customer_zipcode","order_id","category_name","department_name").show(5)
If they are from three different tabels, I would use push down filters to filter them on server and use join between data frame join function to join them together.
If they are not from database tables; you can use filter and map high order function to the same parallel.