Find the facility with the longest interval without accidents using Apache Spark SQL - scala

I have the next dataset:
|facility|date |accidents|
| foo |2019-01-01|1 |
| foo |2019-01-02|null |
| foo |2019-01-03|null |
| foo |2019-01-04|2 |
| bar |2019-01-01|1 |
| bar |2019-01-02|null |
| bar |2019-01-03|3 |
And the goal is to find a facility with the longest continuous period of time without accidents:
|facility|startDate |interval|
|foo |2019-01-02|2 |
Is it possible to do this using Spark SQL? Thanks
P.S. Code sample:
case class FacilityRecord(name: String, date: java.sql.Date, accidents: Option[Int])
case class IntervalWithoutAccidents(name: String, startDate: java.sql.Date, interval: Int)
implicit val spark: SparkSession = SparkSession.builder
.appName("Test")
.master("local")
.getOrCreate()
import spark.implicits._
val facilityRecords = Seq(
FacilityRecord("foo", Date.valueOf("2019-01-01"), Some(1)),
FacilityRecord("foo", Date.valueOf("2019-01-02"), None),
FacilityRecord("foo", Date.valueOf("2019-01-03"), None),
FacilityRecord("foo", Date.valueOf("2019-01-04"), Some(2)),
FacilityRecord("bar", Date.valueOf("2019-01-01"), Some(1)),
FacilityRecord("bar", Date.valueOf("2019-01-02"), None),
FacilityRecord("bar", Date.valueOf("2019-01-03"), Some(3))
)
val facilityRecordsDataset = spark.createDataset(facilityRecords)
facilityRecordsDataset.show()
val intervalWithoutAccidents: IntervalWithoutAccidents = ??? // TODO: find the interval
val expectedInterval = IntervalWithoutAccidents("foo", startDate = Date.valueOf("2019-01-02"), interval = 2)
assert(expectedInterval == intervalWithoutAccidents)
println(intervalWithoutAccidents)

Here's a 2-step approach:
Create column accident_date and for each facility compute interval value in every row between the current date and the next accident date using Window function first.
Compute the max interval per facility using Window function max and filter for the rows that have the max interval value.
Example code below:
import java.sql.Date
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = Seq(
("foo", Date.valueOf("2019-01-01"), Some(1)),
("foo", Date.valueOf("2019-01-02"), None),
("foo", Date.valueOf("2019-01-03"), None),
("foo", Date.valueOf("2019-01-04"), Some(2)),
("bar", Date.valueOf("2019-01-01"), Some(1)),
("bar", Date.valueOf("2019-01-02"), None),
("bar", Date.valueOf("2019-01-03"), Some(3))
).toDF("facility", "date", "accidents")
val win = Window.partitionBy($"facility").orderBy($"date").
rowsBetween(0, Window.unboundedFollowing)
Step #1: Compute interval
val df2 = df.
withColumn("accident_date", when($"accidents".isNotNull, $"date")).
withColumn("interval",
datediff(first($"accident_date", ignoreNulls=true).over(win), $"date")
)
df2.show
// +--------+----------+---------+-------------+--------+
// |facility| date|accidents|accident_date|interval|
// +--------+----------+---------+-------------+--------+
// | bar|2019-01-01| 1| 2019-01-01| 0|
// | bar|2019-01-02| null| null| 1|
// | bar|2019-01-03| 3| 2019-01-03| 0|
// | foo|2019-01-01| 1| 2019-01-01| 0|
// | foo|2019-01-02| null| null| 2|
// | foo|2019-01-03| null| null| 1|
// | foo|2019-01-04| 2| 2019-01-04| 0|
// +--------+----------+---------+-------------+--------+
Step #2: Compute max interval
df2.select($"facility", $"date".as("start_date"),
max($"interval").over(Window.partitionBy($"facility")).as("max_interval")
).
where($"interval" === $"max_interval").
show
// +--------+----------+------------+
// |facility|start_date|max_interval|
// +--------+----------+------------+
// | bar|2019-01-02| 1|
// | foo|2019-01-02| 2|
// +--------+----------+------------+

Well, we can derive this just with One-step **SQL analytics function **lag()over(Window) and first_value()over(window).
Here is the code (added one extra test where 'foo' had another accident on "2019-01-05" and also no-accident on "2019-01-06")
val accidentDf = Seq(
("foo", Date.valueOf("2019-01-01"), Some(1)),
("foo", Date.valueOf("2019-01-02"), None),
("foo", Date.valueOf("2019-01-03"), None),
("foo", Date.valueOf("2019-01-04"), Some(2)),
("bar", Date.valueOf("2019-01-01"), Some(1)),
("bar", Date.valueOf("2019-01-02"), None),
("bar", Date.valueOf("2019-01-03"), Some(3)),
("foo", Date.valueOf("2019-01-05"), Some(3)),
("foo", Date.valueOf("2019-01-06"), None)
).toDF("facility", "date", "accidents")
accidentDf.createOrReplaceTempView("accident_table")
now we try to find for a particular row in a facility partition when did the last accident happened . If no accident happened then we keep the last_accident_report_date as first_value else call the date in the row as when the accident happened
Then we see for each rows what is the datediff on date and last row's last_accident_report_date
Then we select where datediff is highest.
Here is the query
val sparkSql="""select facility,date,accidents ,
lag(CASE
WHEN accidents is NULL
then first(date) over(partition by facility order by date)
else date END ,1)
over(partition by facility order by date) as last_accident_report_date ,
datediff(date,
lag(CASE WHEN accidents is NULL then first(date)
over(partition by facility order by date) else date END ,1)
over(partition by facility order by date))
as no_accident_days_rank from accident_table order by no_accident_days_rank desc, facility"""
RESULT
scala> spark.sql(sparkSql).show(20,false)
+--------+----------+---------+-------------------------+---------------------+
|facility|date |accidents|last_accident_report_date|no_accident_days_rank|
+--------+----------+---------+-------------------------+---------------------+
|foo |2019-01-04|2 |2019-01-01 |3 |
|bar |2019-01-03|3 |2019-01-01 |2 |
|foo |2019-01-03|null |2019-01-01 |2 |
|bar |2019-01-02|null |2019-01-01 |1 |
|foo |2019-01-02|null |2019-01-01 |1 |
|foo |2019-01-06|null |2019-01-05 |1 |
|foo |2019-01-05|3 |2019-01-04 |1 |
|bar |2019-01-01|1 |null |null |
|foo |2019-01-01|1 |null |null |
+--------+----------+---------+-------------------------+---------------------+

Related

Scala - Return the largest string within each group

DataSet:
+---+--------+
|age| name|
+---+--------+
| 33| Will|
| 26|Jean-Luc|
| 55| Hugh|
| 40| Deanna|
| 68| Quark|
| 59| Weyoun|
| 37| Gowron|
| 54| Will|
| 38| Jadzia|
| 27| Hugh|
+---+--------+
Here is my attempt but it just returns the size of the largest string rather than the largest string:
AgeName.groupBy("age")
.agg(max(length(AgeName("name")))).show()
The usual row_number trick should work if you specify the Window correctly. Using #LeoC's example,
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
val df2 = df.withColumn(
"rownum",
expr("row_number() over (partition by age order by length(name) desc)")
).filter("rownum = 1").drop("rownum")
df2.show
+---+---------+
|age| name|
+---+---------+
| 22|Alexander|
| 35| Michelle|
+---+---------+
Here's one approach using Spark higher-order function, aggregate, as shown below:
val df = Seq(
(35, "John"),
(22, "Jennifer"),
(22, "Alexander"),
(35, "Michelle"),
(22, "Celia")
).toDF("age", "name")
df.
groupBy("age").agg(collect_list("name").as("names")).
withColumn(
"longest_name",
expr("aggregate(names, '', (acc, x) -> case when length(acc) < length(x) then x else acc end)")
).
show(false)
// +---+----------------------------+------------+
// |age|names |longest_name|
// +---+----------------------------+------------+
// |22 |[Jennifer, Alexander, Celia]|Alexander |
// |35 |[John, Michelle] |Michelle |
// +---+----------------------------+------------+
Note that higher-order functions are available only on Spark 2.4+.
object BasicDatasetTest {
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder()
.master("local[*]")
.appName("BasicDatasetTest")
.getOrCreate()
val pairs=List((33,"Will"),(26,"Jean-Luc"),
(55, "Hugh"),
(26, "Deanna"),
(26, "Quark"),
(55, "Weyoun"),
(33, "Gowron"),
(55, "Will"),
(26, "Jadzia"),
(27, "Hugh"))
val schema=new StructType(Array(
StructField("age",IntegerType,false),
StructField("name",StringType,false))
)
val dataRDD=spark.sparkContext.parallelize(pairs).map(record=>Row(record._1,record._2))
val dataset=spark.createDataFrame(dataRDD,schema)
val ageNameGroup=dataset.groupBy("age","name")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageNameGroup.printSchema()
val ageGroup=dataset.groupBy("age")
.agg(max(length(col("name"))))
.withColumnRenamed("max(length(name))","length")
ageGroup.printSchema()
ageGroup.createOrReplaceTempView("age_group")
ageNameGroup.createOrReplaceTempView("age_name_group")
spark.sql("select ag.age,ang.name from age_group as ag, age_name_group as ang " +
"where ag.age=ang.age and ag.length=ang.length")
.show()
}
}

Spark: Row filter based on Column value

I have millions of rows as dataframe like this:
val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
Now I want to divide this data into three separate dataFrame like this:
Only ACTIVE ids (like id2), say activeDF
Only INACTIVE ids (like id3), say inactiveDF
Having both ACTIVE and INACTIVE as status, say bothDF
How can I calculate activeDF and inactiveDF?
I know that bothDF can be calculated like
df.select("id").distinct.except(activeDF).except(inactiveDF)
, but this will involve shuffling (as 'distinct' operation required same). Is there any better way to calculate bothDF
Versions:
Spark : 2.2.1
Scala : 2.11
The most elegant solution is to pivot on status
val counts = df
.groupBy("id")
.pivot("status", Seq("ACTIVE", "INACTIVE"))
.count
or equivalent direct agg
val counts = df
.groupBy("id")
.agg(
count(when($"status" === "ACTIVE", true)) as "ACTIVE",
count(when($"status" === "INACTIVE", true)) as "INACTIVE"
)
followed by a simple CASE ... WHEN:
val result = counts.withColumn(
"status",
when($"ACTIVE" === 0, "INACTIVE")
.when($"inactive" === 0, "ACTIVE")
.otherwise("BOTH")
)
result.show
+---+------+--------+--------+
| id|ACTIVE|INACTIVE| status|
+---+------+--------+--------+
|id3| 0| 2|INACTIVE|
|id1| 1| 2| BOTH|
|id2| 1| 0| ACTIVE|
+---+------+--------+--------+
Later you can separate the result with filters or dump to disk with source that supports partitionBy (How to split a dataframe into dataframes with same column values?).
just another way - groupBy, collect as set and then if the size of the set is 1, it is either active or inactive only, else both
scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE |
|id5|ACTIVE |
|id6|INACTIVE|
|id7|ACTIVE |
|id7|INACTIVE|
+---+--------+
scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]
scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE] |
|id5|[ACTIVE] |
|id6|[INACTIVE] |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE] |
|id4|[ACTIVE] |
+---+------------------+
scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus |status |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH |
|id3|[INACTIVE] |INACTIVE|
|id5|[ACTIVE] |ACTIVE |
|id6|[INACTIVE] |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH |
|id2|[ACTIVE] |ACTIVE |
|id4|[ACTIVE] |ACTIVE |
+---+------------------+--------+

Add new record before another in Spark

I have a Dataframe:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:02 2
I want to add a new record with Spark-Scala before with the same time minus 1 second when the value is 2.
The output would be:
| ID | TIMESTAMP | VALUE |
1 15:00:01 3
1 17:04:01 2
1 17:04:02 2
Thanks
You need a .flatMap()
Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
val data = (spark.createDataset(Seq(
(1, "15:00:01", 3),
(1, "17:04:02", 2)
)).toDF("ID", "TIMESTAMP_STR", "VALUE")
.withColumn("TIMESTAMP", $"TIMESTAMP_STR".cast("timestamp").as("TIMESTAMP"))
.drop("TIMESTAMP_STR")
.select("ID", "TIMESTAMP", "VALUE")
)
data.as[(Long, java.sql.Timestamp, Long)].flatMap(r => {
if(r._3 == 2) {
Seq(
(r._1, new java.sql.Timestamp(r._2.getTime() - 1000L), r._3),
(r._1, r._2, r._3)
)
} else {
Some(r._1, r._2, r._3)
}
}).toDF("ID", "TIMESTAMP", "VALUE").show()
Which results in:
+---+-------------------+-----+
| ID| TIMESTAMP|VALUE|
+---+-------------------+-----+
| 1|2019-03-04 15:00:01| 3|
| 1|2019-03-04 17:04:01| 2|
| 1|2019-03-04 17:04:02| 2|
+---+-------------------+-----+
You can introduce a new column array - when value =2 then Array(-1,0) else Array(0), then explode that column and add it with the timestamp as seconds. The below one should work for you. Check this out:
scala> val df = Seq((1,"15:00:01",3),(1,"17:04:02",2)).toDF("id","timestamp","value")
df: org.apache.spark.sql.DataFrame = [id: int, timestamp: string ... 1 more field]
scala> val df2 = df.withColumn("timestamp",'timestamp.cast("timestamp"))
df2: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 1 more field]
scala> df2.show(false)
+---+-------------------+-----+
|id |timestamp |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala> val df3 = df2.withColumn("newc", when($"value"===lit(2),lit(Array(-1,0))).otherwise(lit(Array(0))))
df3: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 2 more fields]
scala> df3.show(false)
+---+-------------------+-----+-------+
|id |timestamp |value|newc |
+---+-------------------+-----+-------+
|1 |2019-03-04 15:00:01|3 |[0] |
|1 |2019-03-04 17:04:02|2 |[-1, 0]|
+---+-------------------+-----+-------+
scala> val df4 = df3.withColumn("c_explode",explode('newc)).withColumn("timestamp2",to_timestamp(unix_timestamp('timestamp)+'c_explode))
df4: org.apache.spark.sql.DataFrame = [id: int, timestamp: timestamp ... 4 more fields]
scala> df4.select($"id",$"timestamp2",$"value").show(false)
+---+-------------------+-----+
|id |timestamp2 |value|
+---+-------------------+-----+
|1 |2019-03-04 15:00:01|3 |
|1 |2019-03-04 17:04:01|2 |
|1 |2019-03-04 17:04:02|2 |
+---+-------------------+-----+
scala>
If you want the time part alone, then you can do like
scala> df4.withColumn("timestamp",from_unixtime(unix_timestamp('timestamp2),"HH:mm:ss")).select($"id",$"timestamp",$"value").show(false)
+---+---------+-----+
|id |timestamp|value|
+---+---------+-----+
|1 |15:00:01 |3 |
|1 |17:04:01 |2 |
|1 |17:04:02 |2 |
+---+---------+-----+

Is it possible to ignore null values when using LEAD window function in Spark?

My dataframe like this
id value date
1 100 2017
1 null 2016
1 20 2015
1 100 2014
I would like to get most recent previous value but ignoring null
id value date recent value
1 100 2017 20
1 null 2016 20
1 20 2015 100
1 100 2014 null
Is there any way to ignore null values while using lead window function?
Is it possible to ignore null values when using lead window function in Spark
It is not.
I would like to get most recent value but ignoring null
Just use last (or first) with ignoreNulls:
def last(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last non-null value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017), (1, None, 2016), (1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
df.withColumn(
"last_value",
last("value", true).over(Window.partitionBy("id").orderBy("date"))
).show
+---+-----+----+----------+
| id|value|date|last_value|
+---+-----+----+----------+
| 1| 100|2014| 100|
| 1| 20|2015| 20|
| 1| null|2016| 20|
| 1| 100|2017| 100|
+---+-----+----+----------+
Spark 3.2+ provides ignoreNulls inside lead and lag in Scala.
lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Test input:
import org.apache.spark.sql.expressions.Window
val df = Seq[(Integer, Integer, Integer)](
(1, 100, 2017),
(1, null, 2016),
(1, 20, 2015),
(1, 100, 2014)
).toDF("id", "value", "date")
lead:
val w = Window.partitionBy("id").orderBy(desc("date"))
val df2 = df.withColumn("lead_val", lead($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2017| 20|
// | 1| null|2016| 20|
// | 1| 20|2015| 100|
// | 1| 100|2014| null|
// +---+-----+----+--------+
lag:
val w = Window.partitionBy("id").orderBy("date")
val df2 = df.withColumn("lead_val", lag($"value", 1, null, true).over(w))
df2.show()
// +---+-----+----+--------+
// | id|value|date|lead_val|
// +---+-----+----+--------+
// | 1| 100|2014| null|
// | 1| 20|2015| 100|
// | 1| null|2016| 20|
// | 1| 100|2017| 20|
// +---+-----+----+--------+
You could do it in two steps:
Create a table with non null values
Join on the original table
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Some(100), 2017),
(1, None, 2016),
(1, Some(20), 2015),
(1, Some(100), 2014)
).toDF("id", "value", "date")
// Step 1
val filledDf = df
.where($"value".isNotNull)
.withColumnRenamed("value", "recent_value")
// Step 2
val window: WindowSpec = Window.partitionBy("l.id", "l.date").orderBy($"r.date".desc)
val finalDf = df.as("l")
.join(filledDf.as("r"), $"l.id" === $"r.id" && $"l.date" > $"r.date", "left")
.withColumn("rn", row_number().over(window))
.where($"rn" === 1)
.select("l.id", "l.date", "value", "recent_value")
finalDf.orderBy($"date".desc).show
+---+----+-----+------------+
| id|date|value|recent_value|
+---+----+-----+------------+
| 1|2017| 100| 20|
| 1|2016| null| 20|
| 1|2015| 20| 100|
| 1|2014| 100| null|
+---+----+-----+------------+

spark flatten records using a key column

I am trying to implement the logic to flatten the records using spark/Scala API. I am trying to use map function.
Could you please help me with the easiest approach to solve this problem?
Assume, for a given key I need to have 3 process codes
Input dataframe-->
Keycol|processcode
John |1
Mary |8
John |2
John |4
Mary |1
Mary |7
==============================
Output dataframe-->
Keycol|processcode1|processcode2|processcode3
john |1 |2 |4
Mary |8 |1 |7
Assuming same number of rows per Keycol, one approach would be to aggregate processcode into an array for each Keycol and expand out into individual columns:
val df = Seq(
("John", 1),
("Mary", 8),
("John", 2),
("John", 4),
("Mary", 1),
("Mary", 7)
).toDF("Keycol", "processcode")
val df2 = df.groupBy("Keycol").agg(collect_list("processcode").as("processcode"))
val numCols = df2.select( size(col("processcode")) ).as[Int].first
val cols = (0 to numCols - 1).map( i => col("processcode")(i) )
df2.select(col("Keycol") +: cols: _*).show
+------+--------------+--------------+--------------+
|Keycol|processcode[0]|processcode[1]|processcode[2]|
+------+--------------+--------------+--------------+
| Mary| 8| 1| 7|
| John| 1| 2| 4|
+------+--------------+--------------+--------------+
A couple of alternative approaches.
SQL
df.createOrReplaceTempView("tbl")
val q = """
select keycol,
c[0] processcode1,
c[1] processcode2,
c[2] processcode3
from (select keycol, collect_list(processcode) c
from tbl
group by keycol) t0
"""
sql(q).show
Result
scala> sql(q).show
+------+------------+------------+------------+
|keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
PairRDDFunctions (groupByKey) + mapPartitions
import org.apache.spark.sql.Row
val my_rdd = df.map{ case Row(a1: String, a2: Int) => (a1, a2)
}.rdd.groupByKey().map(t => (t._1, t._2.toList))
def f(iter: Iterator[(String, List[Int])]) : Iterator[Row] = {
var res = List[Row]();
while (iter.hasNext) {
val (keycol: String, c: List[Int]) = iter.next
res = res ::: List(Row(keycol, c(0), c(1), c(2)))
}
res.iterator
}
import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
val schema = new StructType().add(
StructField("Keycol", StringType, true)).add(
StructField("processcode1", IntegerType, true)).add(
StructField("processcode2", IntegerType, true)).add(
StructField("processcode3", IntegerType, true))
spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
Result
scala> spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
+------+------------+------------+------------+
|Keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
| Mary| 1| 7| 8|
| John| 4| 1| 2|
+------+------------+------------+------------+
Please keep in mind that in all cases order of values in columns for process codes is undetermined unless explicitly specified.