I have a large dataset and I would like to find rows with n highest values.
id, count
id1, 10
id2, 15
id3, 5
...
The only method I can think of is using row_number without partition like
val window = Window.orderBy(desc("count"))
df.withColumn("row_number", row_number over window).filter(col("row_number") <= n)
but this is in no way performant when the data contains millions or billions of rows because it pushes the data into one partition and I get OOM.
Has anyone managed to come up with a performant solution?
I see two methods to improve your algorithm performance. First is to use sort and limit to retrieve the top n rows. The second is to develop your custom Aggregator.
Sort and Limit method
You sort your dataframe and then you take the first n rows:
val n: Int = ???
import org.apache.spark.functions.sql.desc
df.orderBy(desc("count")).limit(n)
Spark optimizes this kind of transformations sequence by first performing sort on each partition, taking first n rows on each partition, retrieving it on a final partition and reperforming sort and taking final first n rows. You can check this by executing explain() on transformations. You get the following execution plan:
== Physical Plan ==
TakeOrderedAndProject(limit=3, orderBy=[count#8 DESC NULLS LAST], output=[id#7,count#8])
+- LocalTableScan [id#7, count#8]
And by looking how TakeOrderedAndProject step is executed in limit.scala in Spark's source code (case class TakeOrderedAndProjectExec, method doExecute).
Custom Aggregator method
For custom aggregator, you create an Aggregator that will populate and update an ordered array of top n rows.
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import scala.collection.mutable.ArrayBuffer
case class Record(id: String, count: Int)
case class TopRecords(limit: Int) extends Aggregator[Record, ArrayBuffer[Record], Seq[Record]] {
def zero: ArrayBuffer[Record] = ArrayBuffer.empty[Record]
def reduce(topRecords: ArrayBuffer[Record], currentRecord: Record): ArrayBuffer[Record] = {
val insertIndex = topRecords.lastIndexWhere(p => p.count > currentRecord.count)
if (topRecords.length < limit) {
topRecords.insert(insertIndex + 1, currentRecord)
} else if (insertIndex < limit - 1) {
topRecords.insert(insertIndex + 1, currentRecord)
topRecords.remove(topRecords.length - 1)
}
topRecords
}
def merge(topRecords1: ArrayBuffer[Record], topRecords2: ArrayBuffer[Record]): ArrayBuffer[Record] = {
val merged = ArrayBuffer.empty[Record]
while (merged.length < limit && (topRecords1.nonEmpty || topRecords2.nonEmpty)) {
if (topRecords1.isEmpty) {
merged.append(topRecords2.remove(0))
} else if (topRecords2.isEmpty) {
merged.append(topRecords1.remove(0))
} else if (topRecords2.head.count < topRecords1.head.count) {
merged.append(topRecords1.remove(0))
} else {
merged.append(topRecords2.remove(0))
}
}
merged
}
def finish(reduction: ArrayBuffer[Record]): Seq[Record] = reduction
def bufferEncoder: Encoder[ArrayBuffer[Record]] = ExpressionEncoder[ArrayBuffer[Record]]
def outputEncoder: Encoder[Seq[Record]] = ExpressionEncoder[Seq[Record]]
}
And then you apply this aggregator on your dataframe, and flatten the aggregation result:
val n: Int = ???
import sparkSession.implicits._
df.as[Record].select(TopRecords(n).toColumn).flatMap(record => record)
Method comparison
To compare those two methods, let's say we want to take top n rows of a dataframe that is distributed on p partitions, each partition having around k records. So dataframe has size p·k. Which gives the following complexity (subject to errors):
method
total number of operations
memory consumption(on executor)
memory consumption(on final executor)
Current code
O(p·k·log(p·k))
--
O(p·k)
Sort and Limit
O(p·k·log(k) + p·n·log(p·n))
O(k)
O(p·n)
Custom Aggregator
O(p·k)
O(k) + O(n)
O(p·n)
So regarding number of operations, Custom Aggregator is the most performant. However, this method is by far the most complex and implies lots of serialization/deserialization so it may be less performant than Sort and Limit on certain case.
Conclusion
You have two methods to efficiently take top n rows, Sort and Limit and Custom Aggregator. To select which one to use, you should benchmark those two methods with your real dataframe. If after benchmarking Sort and Limit is a bit slower than Custom aggregator, I would select Sort and Limit as its code is a lot easier to maintain.
Convert to rdd
mapPartitions sorting the data, take N
Convert to df
Then sort and rank and take top N. Unlikely you will have OOM
Actual example, slightly updated, roll your own approach for posterity.
import org.apache.spark.sql.functions._
import spark.sqlContext.implicits._
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
// 1. Create data
val data = Seq(("James ","","Smith","36636","M",33000),
("Michael ","Rose","","40288","M",14000),
("Robert ","","Williams","42114","M",40),
("Robert ","","Williams","42114","M",540),
("Robert ","","Zeedong","42114","M",40000000),
("Maria ","Anne","Jones","39192","F",300),
("Maria ","Anne","Vangelis","39192","F",1300),
("Jen","Mary","Brown","","F",-1))
val columns = Seq("firstname","middlename","lastname","dob","gender","val")
val df = data.toDF(columns:_*)
//df.show()
//2. Any number of partitions, and sort that partition. Combiner function like Hadoop.
val df2 = df.repartition(1000,col("lastname")).sortWithinPartitions(desc("val"))
//df2.rdd.glom().collect()
//3. Take top N per partition. Thus num partitions x 2 in this case. The take(n) is the top n per partition. No OOM.
val rdd2 = df2.rdd.mapPartitions(_.take(2))
//4. Ghastly Row to DF work-arounds.
val schema = new StructType()
.add(StructField("f", StringType, true))
.add(StructField("m", StringType, true))
.add(StructField("l", StringType, true))
.add(StructField("d", StringType, true))
.add(StructField("g", StringType, true))
.add(StructField("v", IntegerType, true))
val df3 = spark.createDataFrame(rdd2, schema)
//5. Sort and take top(n) = 2 and Bob's your uncle. The Reduce after Combine.
df3.sort(col("v").desc).limit(2).show()
Returns for top 2 desc:
+-------+---+-------+-----+---+--------+
| f| m| l| d| g| v|
+-------+---+-------+-----+---+--------+
|Robert | |Zeedong|42114| M|40000000|
| James | | Smith|36636| M| 33000|
+-------+---+-------+-----+---+--------+
Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).
Given a list of date ranges, some of which overlap:
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-29"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-08-19","2018-08-20"), // range 5
("Mike","2018-10-01","2018-10-20"), // range 6
("Mike","2018-10-10","2018-10-30") // range 6
).toDF("name", "start", "end")
I'd like to reduce the data down to the minimum set of date ranges that completely encapsulate the above dates with no extra dates added:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
EDIT: Added three new entries to the test data to account for new edge cases.
I cannot rely on the dates being in any particular order.
My best attempt at this so far:
Explode each date range into its set of individual days
Union the sets together into one big set of all the days
Sort the set into a list so the days are in order
Aggregate the individual days back into a list of lists of days.
Take the first and last day of each list as the new ranges.
The code, such as it is:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import scala.collection.immutable.NumericRange
import java.time.LocalDate
case class MyRange(start:String, end:String)
val combineRanges = udf((ranges: Seq[Row]) => {
ranges.map(r => LocalDate.parse(r(0).toString).toEpochDay to LocalDate.parse(r(1).toString).toEpochDay)
.map(_.toIndexedSeq).reduce(_ ++ _).distinct.toList.sorted
.aggregate(List.empty[Vector[Long]])((ranges:List[Vector[Long]], d:Long) => {
ranges.lastOption.find(_.last + 1 == d) match {
case Some(r:Vector[Long]) => ranges.dropRight(1) :+ (r :+ d)
case None => ranges :+ Vector(d)
}
}, _ ++ _).map(v => MyRange(LocalDate.ofEpochDay(v.head).toString, LocalDate.ofEpochDay(v.last).toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges.start", $"ranges.end")
.show(false)
It seems to work, but it is very ugly and probably wasteful of time and memory.
I was kind of hoping to use the scala Range class to only notionally explode the date ranges into their individual days, but I've got a feeling the sort operation forces scala's hand and makes it actually create a list of all the dates in memory.
Does anyone have a better way of doing this?
I think the easiest (and most readable) way is to explode the ranges into individual days and then aggregate back into intervals. As the number of days cannot grow too large, I think exploding is not a bottleneck here. I show a "pure-Scala" solution which is then used inside an UDF which gets all the intervals from a collect_list aggregation :
import java.time.LocalDate
import java.time.temporal.ChronoUnit
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toList
}
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
val combineRanges = udf((data: Seq[Row]) => {
val dateEnumerated =
data
.toSet[Row] // use Set to save memory if many spans overlap
// "explode" date spans into individual days
.flatMap { case Row(start: String, end: String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)) }
.toVector
.sorted
// combine subsequent dates into Vectors
dateEnumerated.tail
// combine subsequent dates into Vectors
.foldLeft(Vector(Vector(dateEnumerated.head)))((agg, curr) => {
if (ChronoUnit.DAYS.between(agg.last.last, curr) == 1L) {
agg.init :+ (agg.last :+ curr)
} else {
agg :+ Vector(curr)
}
})
// now get min/max of dates per span
.map(r => (r.min.toString, r.max.toString))
})
df.groupBy("name")
.agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges")
.withColumn("ranges", explode($"ranges"))
.select($"name", $"ranges._1".as("start"), $"ranges._2".as("end"))
.show(false)
gives
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
I think it's also doable with more logic DataFrame API. I would still explode using UDFs, but then use Window-Functions and groupBy to build the new block based on number of days between 2 dates. But I think the above solution is also ok
Here is an alternative with DFs and SPARK SQL both non-procedural and procedural by definition. You need to read well and persist.
// Aspects such as caching and re-partitioning for performance not considered. On the other hand it all happens under the bonnet wth DF's - so they say.
// Functional only.
import org.apache.spark.sql.functions._
import spark.implicits._
import java.time._
import org.apache.spark.sql.functions.{lead, lag}
import org.apache.spark.sql.expressions.Window
def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
val toEpochDayUdf = udf(toEpochDay(_: String))
val df = Seq(
("Betty","2018-09-05","2018-09-05"), ("Betty","2018-09-05","2018-09-05"),
("Betty","2018-09-05","2018-09-08"), ("Betty","2018-09-07","2018-09-10"),
("Betty","2018-09-07","2018-09-08"), ("Betty","2018-09-06","2018-09-07"),
("Betty","2018-09-10","2018-09-15"), ("Betty","2017-09-10","2017-09-15"),
("XXX","2017-09-04","2017-09-10"), ("XXX","2017-09-10","2017-09-15"),
("YYY","2017-09-04","2017-09-10"), ("YYY","2017-09-11","2017-09-15"),
("Bob","2018-09-01","2018-09-02"), ("Bob","2018-09-04","2018-09-05"),
("Bob","2018-09-06","2018-09-07"), ("Bob","2019-09-04","2019-09-05"),
("Bob","2019-09-06","2019-09-07"), ("Bob","2018-09-08","2018-09-22")
).toDF("name", "start", "end")
// Remove any duplicates - pointless to n-process these!
val df2 = df.withColumn("s", toEpochDayUdf($"start")).withColumn("e", toEpochDayUdf($"end")).distinct
df2.show(false) // The original input
df2.createOrReplaceTempView("ranges")
// Find those records encompassed by a broader time frame and hence not required for processing.
val q = spark.sql(""" SELECT *
FROM ranges r1
WHERE EXISTS (SELECT r2.name
FROM ranges r2
WHERE r2.name = r1.name
AND ((r1.s >= r2.s AND r1.e < r2.e) OR
(r1.e <= r2.e AND r1.s > 2.s))
)
""")
//q.show(false)
val df3 = df2.except(q) // Overlapping or on their own / single range records left.
//df3.show(false)
df3.createOrReplaceTempView("ranges2")
// Find those ranges that have a gap between them and the next adjacent records, before or after, i.e. records that exist on their own and are in fact per de facto the first part of the answer.
val q2 = spark.sql(""" SELECT *
FROM ranges2 r1
WHERE NOT EXISTS (SELECT r2.name
FROM ranges2 r2
WHERE r2.name = r1.name
AND (r2.e >= r1.s - 1 AND r2.s <= r1.s - 1 ) OR
(r2.s <= r1.e + 1 AND r2.e >= r1.e + 1 ))
)
""")
// Store the first set of records that exist on their own with some form of gap, first part of result overall result set.
val result1 = q2.select("name", "start", "end")
result1.show(false)
// Get the records / ranges that have overlaps to process - the second remaining set of such records to process.
val df4 = df3.except(q2)
//df4.show(false)
//Avoid Serialization errors with lag!
#transient val w = org.apache.spark.sql.expressions.Window.partitionBy("name").orderBy("e")
#transient val lag_y = lag("e", 1, -99999999).over(w)
//df.select(lag_y).map(f _).first
val df5 = df4.withColumn("new_col", lag_y)
//df5.show(false)
// Massage data to get results via easier queries, e.g. avoid issues with correlated sub-queries.
val myExpression = "s - new_col"
val df6 = df5.withColumn("result", when($"new_col" === 0, 0).otherwise(expr(myExpression)))
//df6.show(false)
df6.createOrReplaceTempView("ranges3")
val q3 = spark.sql(""" SELECT *, dense_rank() over (PARTITION BY name ORDER BY start ASC) as RANK
FROM ranges3
WHERE new_col = -99999999 OR result > 1
""")
q3.createOrReplaceTempView("rangesSTARTS")
val q4 = spark.sql(""" SELECT *
FROM ranges3
WHERE result <= 1 AND new_col <> -99999999
""")
q4.createOrReplaceTempView("rangesFOLLOWERS")
val q5 = spark.sql(""" SELECT r1.*, r2.start as next_start
FROM rangesSTARTS r1 LEFT JOIN rangesSTARTS r2
ON r2.name = r1.name
AND r2.RANK = r1.RANK + 1
""")
//q5.show(false)
val q6 = q5.withColumn("end_period", when($"next_start".isNull, "2525-01-01").otherwise($"next_start"))
//q6.show(false)
q6.createOrReplaceTempView("rangesSTARTS2")
// Second and final set of results - the head and tail of such set of range records.
val result2 = spark.sql(""" SELECT r1.name, r1.start, MAX(r2.end) as end
FROM rangesFOLLOWERS r2, rangesSTARTS2 r1
WHERE r2.name = r1.name
AND r2.end >= r1.start
AND r2.end < r1.end_period
GROUP BY r1.name, r1.start """)
result2.show(false)
val finalresult = result1.union(result2)
finalresult.show
returns:
+-----+----------+----------+
| name| start| end|
+-----+----------+----------+
| Bob|2018-09-01|2018-09-02|
|Betty|2017-09-10|2017-09-15|
| YYY|2017-09-04|2017-09-15|
| Bob|2018-09-04|2018-09-22|
| Bob|2019-09-04|2019-09-07|
| XXX|2017-09-04|2017-09-15|
|Betty|2018-09-05|2018-09-15|
+-----+----------+----------+
An interesting contrast - what is better for performance and style? My last such effort for a while. Interested in your comments. You know the programming aspects better than I, so this question provides some good comparison and some good education. the other solutions do explode, not how I saw it.
I think my second approach is better, but still far from perfect. It at least avoids iterating through every day in every date range, although it now processes every range multiple times. I think I'm mostly going to be processing a few large ranges instead of a bunch of little ranges, so maybe that's ok.
Given:
val ranges = Seq(
("Mike","2018-09-01","2018-09-10"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-12","2018-09-12"),
("Mike","2018-09-11","2018-09-11"),
("Mike","2018-09-25","2018-09-30"),
("Mike","2018-09-21","2018-09-23"),
("Mike","2018-09-24","2018-09-24"),
("Mike","2018-09-14","2018-09-16"),
("Mike","2018-09-15","2018-09-17"),
("Mike","2018-09-05","2018-09-05"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-09-19","2018-09-19"),
("Mike","2018-08-19","2018-08-20"),
("Mike","2018-10-01","2018-10-20"),
("Mike","2018-10-10","2018-10-30")
)
val df = ranges.toDF("name", "start", "end")
I want:
+----+----------+----------+
|name|start |end |
+----+----------+----------+
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-08-19|2018-08-20|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
(They are not in order this time. I'm ok with that since it was never a requirement. It just happened to be an artifact of my previous approach)
// very specific helper functions to convert a date string to and from a range
implicit class MyString(s:String) {
def toFilteredInt: Int = s.filter(_.isDigit).toInt
def to(s2:String): Range = s.toFilteredInt to s2.toFilteredInt
// this only works for YYYYMMDD strings. very dangerous.
def toDateStr = s"${s.slice(0,4)}-${s.slice(4,6)}-${s.slice(6,8)}"
}
// helper functions to combine two ranges
implicit class MyRange(r:Range) {
def expand(i: Int): Range = r.head - i * r.step to r.last + i * r.step
def containsPart(r2:Range): Boolean = r.contains(r2.head) || r.contains(r2.last)
def intersects(r2:Range): Boolean = r.containsPart(r2) || r2.containsPart(r)
def combine(r2:Range): Option[Range] = {
if (r.step == r2.step && r.intersects(r2 expand 1)) {
if (r.step > 0) Some(Math.min(r.head, r2.head) to Math.max(r.last, r2.last))
else Some(Math.max(r.head, r2.head) to Math.min(r.last, r2.last))
}
else None
}
def toDateStrTuple: (String,String) = (r.start.toString.toDateStr, r.end.toString.toDateStr)
}
// combines a range to one of the ranges in a sequence if possible;
// adds it to the sequence if it can't be combined.
def addToRanges(rs:Seq[Range], r:Range): Seq[Range] = {
if (rs.isEmpty) Seq(r)
else r.combine(rs.last) match {
case Some(r:Range) => rs.init :+ r
case None => addToRanges(rs.init, r) :+ rs.last
}
}
// tries to combine every range in the sequence with every other range
// does not handle the case where combining two ranges together allows
// them to be combined with a previous range in the sequence.
// however, if we call this and nothing has been combined, we know
// we are done
def collapseOnce(rs:Seq[Range]):Seq[Range] = {
if (rs.size <= 1) rs
else addToRanges(collapseOnce(rs.init), rs.last)
}
// keep collapsing the sequence of ranges until they can't collapse
// any further
def collapseAll(rs:Seq[Range]):Seq[Range] = {
val collapsed = collapseOnce(rs)
if (rs.size == collapsed.size) rs
else collapseAll(collapsed)
}
// now our udf is much simpler
val combineRanges = udf((rows: Seq[Row]) => {
val ranges = rows.map(r => r(0).toString to r(1).toString)
collapseAll(ranges).map(_.toDateStrTuple)
})
df.groupBy("name").agg(combineRanges(collect_list(struct($"start", $"end"))) as "ranges"
).withColumn("ranges", explode($"ranges")
).select($"name", $"ranges._1" as "start", $"ranges._2" as "end").show(false)
Room for improvement:
I'm pretty sure I'd get better performance most of the time if I bailed out of collapseOnce as soon as I found a range to combine. The typical use case will be adding a single day range to the last range in the sequence.
collapseOnce and addToRanges are not yet tail recursive.
Some of the date to string and string to date methods in my implicit classes should probably be stand alone methods. They are super specific to my problem and don't deserve to be included in general String and Range classes.
What about
exploding date range into days by a udf
using analytical functions to do the magic
Code:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
def enumerateDays(start: LocalDate, end: LocalDate) = {
Iterator.iterate(start)(d => d.plusDays(1L))
.takeWhile(d => !d.isAfter(end))
.toSeq
}
val udf_enumerateDays = udf( (start:String, end:String) => enumerateDays(LocalDate.parse(start), LocalDate.parse(end)).map(_.toString))
df.select($"name", explode(udf_enumerateDays($"start",$"end")).as("day"))
.distinct
.withColumn("day_prev", lag($"day",1).over(Window.partitionBy($"name").orderBy($"day")))
.withColumn("is_consecutive", coalesce(datediff($"day",$"day_prev"),lit(0))<=1)
.withColumn("group_nb", sum(when($"is_consecutive",lit(0)).otherwise(lit(1))).over(Window.partitionBy($"name").orderBy($"day")))
.groupBy($"name",$"group_nb").agg(min($"day").as("start"), max($"day").as("end"))
.drop($"group_nb")
.orderBy($"name",$"start")
.show
Result:
+----+----------+----------+
|name| start| end|
+----+----------+----------+
|Mike|2018-08-19|2018-08-20|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
|Mike|2018-09-21|2018-09-29|
|Mike|2018-10-01|2018-10-30|
+----+----------+----------+
Here is one more solution using DF without the UDFs
val df = Seq(
("Mike","2018-09-01","2018-09-10"), // range 1
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-12","2018-09-12"), // range 1
("Mike","2018-09-11","2018-09-11"), // range 1
("Mike","2018-09-25","2018-09-30"), // range 4
("Mike","2018-09-21","2018-09-23"), // range 4
("Mike","2018-09-24","2018-09-24"), // range 4
("Mike","2018-09-14","2018-09-16"), // range 2
("Mike","2018-09-15","2018-09-17"), // range 2
("Mike","2018-09-05","2018-09-05"), // range 1
("Mike","2018-09-19","2018-09-19"), // range 3
("Mike","2018-09-19","2018-09-19") // range 3
).toDF("name", "start", "end").withColumn("start",'start.cast("date")).withColumn("end",'end.cast("date"))
df.printSchema()
val df2 = df.as("t1").join(df.as("t2"), $"t1.start" =!= $"t2.start" and $"t1.end" =!= $"t2.end")
.withColumn("date_diff_start",datediff($"t1.start",$"t2.start"))
.withColumn("date_diff_end",datediff($"t1.end",$"t2.end"))
.withColumn("n1_start",when('date_diff_start===1,$"t2.start"))
.withColumn("n1_end",when('date_diff_end === -1,$"t2.end"))
.filter( 'n1_start.isNotNull or 'n1_end.isNotNull)
.withColumn( "new_start", when('n1_start.isNotNull, $"n1_start").otherwise($"t1.start"))
.withColumn( "new_end", when('n1_end.isNotNull, $"n1_end").otherwise($"t1.end"))
.select("t1.name","new_start","new_end")
.distinct
val df3= df2.alias("t3").join(df2.alias("t4"),$"t3.name" === $"t4.name")
.withColumn("x1",when($"t3.new_end"=== $"t4.new_start",1)
.when($"t3.new_start" === $"t4.new_end",1)
.otherwise(0))
.groupBy("t3.name","t3.new_start","t3.new_end")
.agg( min( when('x1===1,$"t4.new_start" ).otherwise($"t3.new_start") ).as("ns"), max(when('x1===1,$"t4.new_end").otherwise($"t3.new_end")).as("ne"))
.select("t3.name","ns","ne")
.distinct
df3.show(false)
val num_combinations = df3.count
val df4 = df.filter('start==='end).distinct.select("name","start").alias("dup")
.join(df3.alias("d4"), $"d4.name"===$"dup.name" , "leftOuter")
.withColumn("cond", ! $"dup.start".between($"ns" , $"ne"))
.filter('cond)
.groupBy("d4.name","start" ).agg(count($"start").as("count"),collect_set('start).as("dup_s1"))
.filter('count===num_combinations)
.withColumn("start",explode('dup_s1))
.withColumn("end",'start)
.select("name","start","end")
df3.union(df4).show(false)
Results:
+----+----------+----------+
|name|ns |ne |
+----+----------+----------+
|Mike|2018-09-21|2018-09-30|
|Mike|2018-09-01|2018-09-12|
|Mike|2018-09-14|2018-09-17|
|Mike|2018-09-19|2018-09-19|
+----+----------+----------+