Given a list of date ranges, some of which overlap:
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-29"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-08-19","2018-08-20"), // range 5
("Mike","2018-10-01","2018-10-20"), // range 6
("Mike","2018-10-10","2018-10-30") // range 6
).toDF("name", "start", "end")
I'd like to reduce the data down to the minimum set of date ranges that completely encapsulate the above dates with no extra dates added:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
EDIT: Added three new entries to the test data to account for new edge cases.
I cannot rely on the dates being in any particular order.
My best attempt at this so far:
Explode each date range into its set of individual days
Union the sets together into one big set of all the days
Sort the set into a list so the days are in order
Aggregate the individual days back into a list of lists of days.
Take the first and last day of each list as the new ranges.
The code, such as it is:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import scala.collection.immutable.NumericRange
import java.time.LocalDate
case class MyRange(start:String, end:String)
val combineRanges = udf((ranges: Seq[Row]) => {
ranges.map(r => LocalDate.parse(r(0).toString).toEpochDay to LocalDate.parse(r(1).toString).toEpochDay)
.map(_.toIndexedSeq).reduce(_ ++ _).distinct.toList.sorted
.aggregate(List.empty[Vector[Long]])((ranges:List[Vector[Long]], d:Long) => {
ranges.lastOption.find(_.last + 1 == d) match {
case Some(r:Vector[Long]) => ranges.dropRight(1) :+ (r :+ d)
case None => ranges :+ Vector(d)
}
}, _ ++ _).map(v => MyRange(LocalDate.ofEpochDay(v.head).toString, LocalDate.ofEpochDay(v.last).toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges.start", $"ranges.end")
.show(false)
It seems to work, but it is very ugly and probably wasteful of time and memory.
I was kind of hoping to use the scala Range class to only notionally explode the date ranges into their individual days, but I've got a feeling the sort operation forces scala's hand and makes it actually create a list of all the dates in memory.
Does anyone have a better way of doing this?
I think the easiest (and most readable) way is to explode the ranges into individual days and then aggregate back into intervals. As the number of days cannot grow too large, I think exploding is not a bottleneck here. I show a "pure-Scala" solution which is then used inside an UDF which gets all the intervals from a collect_list aggregation :
import java.time.LocalDate
import java.time.temporal.ChronoUnit
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toList
}
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
val combineRanges = udf((data: Seq[Row]) => {
val dateEnumerated =
data
.toSet[Row] // use Set to save memory if many spans overlap
// "explode" date spans into individual days
.flatMap { case Row(start: String, end: String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)) }
.toVector
.sorted
// combine subsequent dates into Vectors
dateEnumerated.tail
// combine subsequent dates into Vectors
.foldLeft(Vector(Vector(dateEnumerated.head)))((agg, curr) => {
if (ChronoUnit.DAYS.between(agg.last.last, curr) == 1L) {
agg.init :+ (agg.last :+ curr)
} else {
agg :+ Vector(curr)
}
})
// now get min/max of dates per span
.map(r => (r.min.toString, r.max.toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges._1".as("start"), $"ranges._2".as("end"))
.show(false)
gives
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
I think it's also doable with more logic DataFrame API. I would still explode using UDFs, but then use Window-Functions and groupBy to build the new block based on number of days between 2 dates. But I think the above solution is also ok
Here is an alternative with DFs and SPARK SQL both non-procedural and procedural by definition. You need to read well and persist.
// Aspects such as caching and re-partitioning for performance not considered. On the other hand it all happens under the bonnet wth DF's - so they say.
// Functional only.
import org.apache.spark.sql.functions._
import spark.implicits._
import java.time._
import org.apache.spark.sql.functions.{lead, lag}
import org.apache.spark.sql.expressions.Window
def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
val toEpochDayUdf = udf(toEpochDay(_: String))
val df = Seq(
("Betty","2018-09-05","2018-09-05"), ("Betty","2018-09-05","2018-09-05"),
("Betty","2018-09-05","2018-09-08"), ("Betty","2018-09-07","2018-09-10"),
("Betty","2018-09-07","2018-09-08"), ("Betty","2018-09-06","2018-09-07"),
("Betty","2018-09-10","2018-09-15"), ("Betty","2017-09-10","2017-09-15"),
("XXX","2017-09-04","2017-09-10"), ("XXX","2017-09-10","2017-09-15"),
("YYY","2017-09-04","2017-09-10"), ("YYY","2017-09-11","2017-09-15"),
("Bob","2018-09-01","2018-09-02"), ("Bob","2018-09-04","2018-09-05"),
("Bob","2018-09-06","2018-09-07"), ("Bob","2019-09-04","2019-09-05"),
("Bob","2019-09-06","2019-09-07"), ("Bob","2018-09-08","2018-09-22")
).toDF("name", "start", "end")
// Remove any duplicates - pointless to n-process these!
val df2 = df.withColumn("s", toEpochDayUdf($"start")).withColumn("e", toEpochDayUdf($"end")).distinct
df2.show(false) // The original input
df2.createOrReplaceTempView("ranges")
// Find those records encompassed by a broader time frame and hence not required for processing.
val q = spark.sql(""" SELECT *
FROM ranges r1
WHERE EXISTS (SELECT r2.name
FROM ranges r2
WHERE r2.name = r1.name
AND ((r1.s >= r2.s AND r1.e < r2.e) OR
(r1.e <= r2.e AND r1.s > 2.s))
)
""")
//q.show(false)
val df3 = df2.except(q) // Overlapping or on their own / single range records left.
//df3.show(false)
df3.createOrReplaceTempView("ranges2")
// Find those ranges that have a gap between them and the next adjacent records, before or after, i.e. records that exist on their own and are in fact per de facto the first part of the answer.
val q2 = spark.sql(""" SELECT *
FROM ranges2 r1
WHERE NOT EXISTS (SELECT r2.name
FROM ranges2 r2
WHERE r2.name = r1.name
AND (r2.e >= r1.s - 1 AND r2.s <= r1.s - 1 ) OR
(r2.s <= r1.e + 1 AND r2.e >= r1.e + 1 ))
)
""")
// Store the first set of records that exist on their own with some form of gap, first part of result overall result set.
val result1 = q2.select("name", "start", "end")
result1.show(false)
// Get the records / ranges that have overlaps to process - the second remaining set of such records to process.
val df4 = df3.except(q2)
//df4.show(false)
//Avoid Serialization errors with lag!
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("name").orderBy("e")
#transient val lag_y = lag("e", 1, -99999999).over(w)
//df.select(lag_y).map(f _).first
val df5 = df4.withColumn("new_col", lag_y)
//df5.show(false)
// Massage data to get results via easier queries, e.g. avoid issues with correlated sub-queries.
val myExpression = "s - new_col"
val df6 = df5.withColumn("result", when($"new_col" === 0, 0).otherwise(expr(myExpression)))
//df6.show(false)
df6.createOrReplaceTempView("ranges3")
val q3 = spark.sql(""" SELECT *, dense_rank() over (PARTITION BY name ORDER BY start ASC) as RANK
FROM ranges3
WHERE new_col = -99999999 OR result > 1
""")
q3.createOrReplaceTempView("rangesSTARTS")
val q4 = spark.sql(""" SELECT *
FROM ranges3
WHERE result <= 1 AND new_col <> -99999999
""")
q4.createOrReplaceTempView("rangesFOLLOWERS")
val q5 = spark.sql(""" SELECT r1.*, r2.start as next_start
FROM rangesSTARTS r1 LEFT JOIN rangesSTARTS r2
ON r2.name = r1.name
AND r2.RANK = r1.RANK + 1
""")
//q5.show(false)
val q6 = q5.withColumn("end_period", when($"next_start".isNull, "2525-01-01").otherwise($"next_start"))
//q6.show(false)
q6.createOrReplaceTempView("rangesSTARTS2")
// Second and final set of results - the head and tail of such set of range records.
val result2 = spark.sql(""" SELECT r1.name, r1.start, MAX(r2.end) as end
FROM rangesFOLLOWERS r2, rangesSTARTS2 r1
WHERE r2.name = r1.name
AND r2.end >= r1.start
AND r2.end < r1.end_period
GROUP BY r1.name, r1.start """)
result2.show(false)
val finalresult = result1.union(result2)
finalresult.show
returns:
+-----+----------+----------+
| name| start| end|
+-----+----------+----------+
| Bob|2018-09-01|2018-09-02|
|Betty|2017-09-10|2017-09-15|
| YYY|2017-09-04|2017-09-15|
| Bob|2018-09-04|2018-09-22|
| Bob|2019-09-04|2019-09-07|
| XXX|2017-09-04|2017-09-15|
|Betty|2018-09-05|2018-09-15|
+-----+----------+----------+
An interesting contrast - what is better for performance and style? My last such effort for a while. Interested in your comments. You know the programming aspects better than I, so this question provides some good comparison and some good education. the other solutions do explode, not how I saw it.
I think my second approach is better, but still far from perfect. It at least avoids iterating through every day in every date range, although it now processes every range multiple times. I think I'm mostly going to be processing a few large ranges instead of a bunch of little ranges, so maybe that's ok.
Given:
val ranges = Seq(
("Mike","2018-09-01","2018-09-10"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-12","2018-09-12"),
("Mike","2018-09-11","2018-09-11"),
("Mike","2018-09-25","2018-09-30"),
("Mike","2018-09-21","2018-09-23"),
("Mike","2018-09-24","2018-09-24"),
("Mike","2018-09-14","2018-09-16"),
("Mike","2018-09-15","2018-09-17"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-08-19","2018-08-20"),
("Mike","2018-10-01","2018-10-20"),
("Mike","2018-10-10","2018-10-30")
)
val df = ranges.toDF("name", "start", "end")
I want:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
(They are not in order this time. I'm ok with that since it was never a requirement. It just happened to be an artifact of my previous approach)
// very specific helper functions to convert a date string to and from a range
implicit class MyString(s:String) {
def toFilteredInt: Int = s.filter(_.isDigit).toInt
def to(s2:String): Range = s.toFilteredInt to s2.toFilteredInt
// this only works for YYYYMMDD strings. very dangerous.
def toDateStr = s"${s.slice(0,4)}-${s.slice(4,6)}-${s.slice(6,8)}"
}
// helper functions to combine two ranges
implicit class MyRange(r:Range) {
def expand(i: Int): Range = r.head - i * r.step to r.last + i * r.step
def containsPart(r2:Range): Boolean = r.contains(r2.head) || r.contains(r2.last)
def intersects(r2:Range): Boolean = r.containsPart(r2) || r2.containsPart(r)
def combine(r2:Range): Option[Range] = {
if (r.step == r2.step && r.intersects(r2 expand 1)) {
if (r.step > 0) Some(Math.min(r.head, r2.head) to Math.max(r.last, r2.last))
else Some(Math.max(r.head, r2.head) to Math.min(r.last, r2.last))
}
else None
}
def toDateStrTuple: (String,String) = (r.start.toString.toDateStr, r.end.toString.toDateStr)
}
// combines a range to one of the ranges in a sequence if possible;
// adds it to the sequence if it can't be combined.
def addToRanges(rs:Seq[Range], r:Range): Seq[Range] = {
if (rs.isEmpty) Seq(r)
else r.combine(rs.last) match {
case Some(r:Range) => rs.init :+ r
case None => addToRanges(rs.init, r) :+ rs.last
}
}
// tries to combine every range in the sequence with every other range
// does not handle the case where combining two ranges together allows
// them to be combined with a previous range in the sequence.
// however, if we call this and nothing has been combined, we know
// we are done
def collapseOnce(rs:Seq[Range]):Seq[Range] = {
if (rs.size <= 1) rs
else addToRanges(collapseOnce(rs.init), rs.last)
}
// keep collapsing the sequence of ranges until they can't collapse
// any further
def collapseAll(rs:Seq[Range]):Seq[Range] = {
val collapsed = collapseOnce(rs)
if (rs.size == collapsed.size) rs
else collapseAll(collapsed)
}
// now our udf is much simpler
val combineRanges = udf((rows: Seq[Row]) => {
val ranges = rows.map(r => r(0).toString to r(1).toString)
collapseAll(ranges).map(_.toDateStrTuple)
})
df.groupBy("name").agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges"
).withColumn("ranges", explode($"ranges")
).select($"name", $"ranges._1" as "start", $"ranges._2" as "end").show(false)
Room for improvement:
I'm pretty sure I'd get better performance most of the time if I bailed out of collapseOnce as soon as I found a range to combine. The typical use case will be adding a single day range to the last range in the sequence.
collapseOnce and addToRanges are not yet tail recursive.
Some of the date to string and string to date methods in my implicit classes should probably be stand alone methods. They are super specific to my problem and don't deserve to be included in general String and Range classes.
What about
exploding date range into days by a udf
using analytical functions to do the magic
Code:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toSeq
}
val udf_enumerateDays = udf( (start:String, end:String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)).map(_.toString))
df.select($"name", explode(udf_enumerateDays($"start",$"end")).as("day"))
.distinct
.withColumn("day_prev", lag($"day",1).over(Window.partitionBy($"name").orderBy($"day")))
.withColumn("is_consecutive", coalesce(datediff($"day",$"day_prev"),lit(0))<=1)
.withColumn("group_nb", sum(when($"is_consecutive",lit(0)).otherwise(lit(1))).over(Window.partitionBy($"name").orderBy($"day")))
.groupBy($"name",$"group_nb").agg(min($"day").as("start"), max($"day").as("end"))
.drop($"group_nb")
.orderBy($"name",$"start")
.show
Result:
+----+----------+----------+
|name| start| end|
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
Here is one more solution using DF without the UDFs
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-30"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19") // range 3
).toDF("name", "start", "end").withColumn("start",'start.cast("date")).withColumn("end",'end.cast("date"))
df.printSchema()
val df2 = df.as("t1").join(df.as("t2"), $"t1.start" =!= $"t2.start" and $"t1.end" =!= $"t2.end")
.withColumn("date_diff_start",datediff($"t1.start",$"t2.start"))
.withColumn("date_diff_end",datediff($"t1.end",$"t2.end"))
.withColumn("n1_start",when('date_diff_start===1,$"t2.start"))
.withColumn("n1_end",when('date_diff_end === -1,$"t2.end"))
.filter( 'n1_start.isNotNull or 'n1_end.isNotNull)
.withColumn( "new_start", when('n1_start.isNotNull, $"n1_start").otherwise($"t1.start"))
.withColumn( "new_end", when('n1_end.isNotNull, $"n1_end").otherwise($"t1.end"))
.select("t1.name","new_start","new_end")
.distinct
val df3= df2.alias("t3").join(df2.alias("t4"),$"t3.name" === $"t4.name")
.withColumn("x1",when($"t3.new_end"=== $"t4.new_start",1)
.when($"t3.new_start" === $"t4.new_end",1)
.otherwise(0))
.groupBy("t3.name","t3.new_start","t3.new_end")
.agg( min( when('x1===1,$"t4.new_start" ).otherwise($"t3.new_start") ).as("ns"), max(when('x1===1,$"t4.new_end").otherwise($"t3.new_end")).as("ne"))
.select("t3.name","ns","ne")
.distinct
df3.show(false)
val num_combinations = df3.count
val df4 = df.filter('start==='end).distinct.select("name","start").alias("dup")
.join(df3.alias("d4"), $"d4.name"===$"dup.name" , "leftOuter")
.withColumn("cond", ! $"dup.start".between($"ns" , $"ne"))
.filter('cond)
.groupBy("d4.name","start" ).agg(count($"start").as("count"),collect_set('start).as("dup_s1"))
.filter('count===num_combinations)
.withColumn("start",explode('dup_s1))
.withColumn("end",'start)
.select("name","start","end")
df3.union(df4).show(false)
Results:
+----+----------+----------+
|name|ns |ne |
+----+----------+----------+
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
+----+----------+----------+
I have 2 dataframes in spark as mentioned below.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc");
test: org.apache.spark.sql.DataFrame = [test_dt: string]
val test1 = hivecontext.table("testing");
where test1 has columns like id,name,age,audit_dt
I want to compare these 2 dataframes and filter rows from test1 where audit_dt > test_dt. Somehow I am not able to do that. I am able to compare audit_dt with literal date using lit function but i am not able to compare it with another dataframe column.
I am able to compare literal date using lit function as mentioned below
val output = test1.filter(to_date(test1("audit_date")).gt(lit("2017-03-23")))
Max Date in test dataframe is -> 2017-04-26
Data in test1 Dataframe ->
Id,Name,Age,Audit_Dt
1,Rahul,23,2017-04-26
2,Ankit,25,2017-04-26
3,Pradeep,28,2017-04-27
I just need the data for Id=3 since that only row qualifies the greater than criteria of max date.
I have already tried below mentioned option but it is not working.
val test = hivecontext.sql("select max(test_dt) as test_dt from abc")
val MAX_AUDIT_DT = test.first().toString()
val output = test.filter(to_date(test("audit_date")).gt((lit(MAX_AUDIT_DT))))
Can anyone suggest as way to compare it with column of dataframe test?
Thanks
You can use non-equi joins, if both columns "test_dt" and "audit_date" are of class date.
/// cast to correct type
import org.apache.spark.sql.functions.to_date
val new_test = test.withColumn("test_dt",to_date($"test_dt"))
val new_test1 = test1.withColumn("Audit_Dt", to_date($"Audit_Dt"))
/// join
new_test1.join(new_test, $"Audit_Dt" > $"test_dt")
.drop("test_dt").show()
+---+-------+---+----------+
| Id| Name|Age| Audit_Dt|
+---+-------+---+----------+
| 3|Pradeep| 28|2017-04-27|
+---+-------+---+----------+
Data
val test1 = sc.parallelize(Seq((1,"Rahul",23,"2017-04-26"),(2,"Ankit",25,"2017-04-26"),
(3,"Pradeep",28,"2017-04-27"))).toDF("Id","Name", "Age", "Audit_Dt")
val test = sc.parallelize(Seq(("2017-04-26"))).toDF("test_dt")
Try with this:
test1.filter(to_date(test1("audit_date")).gt(to_date(test("test_dt"))))
Store the value in a variable and use in filter.
val dtValue = test.select("test_dt")
OR
val dtValue = test.first().getString(0)
Now apply filter
val output = test1.filter(to_date(test1("audit_date")).gt(lit(dtValue)))