Collapse decimal ranges in Spark DataFrame column - scala

I'm new to Spark and Scala.
I have a DF with ip ranges in one column and some notes in another column.
Some of the ranges(rows) need to be collapsed into one range (and one of those rows dropped). Basically go row by row in pairs and check if they can be collapsed.
Here's a function that does it with lists:
case class Range(start: BigInt, end: BigInt)
#tailrec
final def collapse(rs: List[Range], out: List[Range] = Nil): List[Range] = rs match {
case prev :: cur :: rest =>
if (prev.end >= cur.start) collapse(Range(prev.start, prev.end max cur.end) :: rest, out)
else if (cur.start - prev.end == 1) collapse(Range(prev.start, cur.end) :: rest, out)
else collapse(cur :: rest, prev :: out)
case _ => (rs ::: out).reverse
}
def mergeRanges(rs: List[Range]): List[Range] = collapse(rs.sortBy(_.start))
I need to apply it to DF and I'm stuck there :(
Input DF:
Ranges
Notes
[1688733440,1688733441]
"100.168.7.0/31"
[1688733442,1688733443]
"100.168.7.2/31"
Output DF:
Ranges
Notes
[1688733440,1688733443]
"100.168.7.0/30"

Related

scala - reset to 1 when the duplicated value changes in a list

I'm trying to generate sequence numbers on duplicated elements. It should reset to 1 when the value changes,
val dt = List("date", "date", "decimal", "decimal", "decimal", "string", "string")
var t = 0
dt.sorted.map( x => {t=t+1; (x,t)} )
This gives result as
List((date,1), (date,2), (decimal,3), (decimal,4), (decimal,5), (string,6), (string,7))
But what I expect is to get it as
List((date,1), (date,2), (decimal,1), (decimal,2), (decimal,3), (string,1), (string,2))
how do I change the value of t to 0 when the value changes in my list?.
Are there better methods to get the above output?.
The best method to use for this is scanLeft which is like foldLeft but emits a value at each step. The code looks like this:
val ds = dt.sorted
ds.tail.scanLeft((ds.head, 1)){
case ((prev, n), cur) if prev == cur => (cur, n+1)
case (_, cur) => (cur, 1)
}
At each step it increments the count if the value is the same as the previous, otherwise it resets it to 1.
This will work if the list has a single element. Although tail will be Nil, the first element in the result of scanLeft is always be the first parameter to the method. In this case it is (ds.head, 1).
This will not work if the list is empty, as ds.head will throw an exception. This can be fixed by using a match first:
ds match {
case head :: tail =>
tail.scanLeft((head, 1)) {
case ((prev, n), cur) if prev == cur => (cur, n + 1)
case (_, cur) => (cur, 1)
}
case _ => Nil
}
To reset the counter you need to look back at the previous element, which .map() can't do.
dt.foldLeft(List.empty[(String,Int)]){ case (lst,str) =>
lst.headOption.fold((str,1)::Nil){
case (`str`,cnt) => (str,cnt+1) :: lst
case _ => (str,1) :: lst
}
}.reverse
//res0: List[(String, Int)] = List((date,1), (date,2), (decimal,1), (decimal,2), (decimal,3), (string,1), (string,2))
explanation
foldLeft - consider the dt elements, one at a time, left to right
List.empty[(String,Int)] - we'll build a List of tuples, start with an empty list
case (lst,str) - the list we're building and the current String element from dt
lst.headOption - get the head of the list if it exists
fold((str,1)::Nil) - if lst is empty return a new list with a single element
case (str,cnt) - if the head string element is the same as the current dt element
(str,cnt+1) :: lst - add a new element, with incremented count, to the list
case _ - head string element is different from the current dt element
(str,1) :: lst - add a new element, with count = 1, to the list
.reverse - we've built the results in reverse order, reverse it
Hope this helps.
scala> val dt = List("date", "date", "decimal", "decimal", "decimal", "string", "string")
dt: List[String] = List(date, date, decimal, decimal, decimal, string, string)
scala> val dtset = dt.toSet
dtset: scala.collection.immutable.Set[String] = Set(date, decimal, string)
scala> dtset.map( x => dt.filter( y => y == x))
res41: scala.collection.immutable.Set[List[String]] = Set(List(date, date), List(decimal, decimal, decimal), List(string, string))
scala> dtset.map( x => dt.filter( y => y == x)).flatMap(a => a.zipWithIndex)
res42: scala.collection.immutable.Set[(String, Int)] = Set((string,0), (decimal,1), (decimal,0), (string,1), (date,0), (date,1), (decimal,2))
scala> dtset.map( x => dt.filter( y => y == x)).flatMap(a => a.zipWithIndex).toList
res43: List[(String, Int)] = List((string,0), (decimal,1), (decimal,0), (string,1), (date,0), (date,1), (decimal,2)) // sort this list to your needs
By adding one more mutable string variable, the below one works.
val dt = List("date", "date", "decimal", "decimal", "decimal", "string","string")
var t = 0
var s = ""
val dt_seq = dt.sorted.map( x => { t= if(s!=x) 1 else t+1;s=x; (x,t)} )
Results:
dt_seq: List[(String, Int)] = List((date,1), (date,2), (decimal,1), (decimal,2), (decimal,3), (string,1), (string,2))
Another way is to use groupBy(identity) and get the indices from map values
val dt = List("date", "date", "decimal", "decimal", "decimal", "string","string")
val dtg = dt.groupBy(identity).map( x => (x._2 zip x._2.indices.map(_+1)) ).flatten.toList
which results in
dtg: List[(String, Int)] = List((decimal,1), (decimal,2), (decimal,3), (date,1), (date,2), (string,1), (string,2))
Thanks to #Leo, instead of indices, you can use Stream from 1 with zip that gives the same results.
val dtg = dt.groupBy(identity).map( x => (x._2 zip (Stream from 1)) ).flatten.toList

Scala partition sorted list elements based on distance

I am new to Scala and functional programming. I have a task that I want to partition a Scala list into list of sub-lists where the distance between each element in any sub-list is less than 2. I found a code somewhere online can do this but I don't understand how this code works internally, can someone give a detailed explanation?
def partition(input: List[Int], prev: Int,
splits: List[List[Int]]): List[List[Int]] = {
input match {
case Nil => splits
case h :: t if h-prev < 2 => partition(t, h, (h :: splits.head) :: splits.tail)
case h :: t => partition(t, h, List(h) :: splits)
}
}
val input = List(1,2,3,5,6,7,10)
partition(input,input.head,List(List.empty[Int]))
The result is as follows:
List[List[Int]] = List(List(10), List(7, 6, 5), List(3, 2, 1))
which is the desired outcome.
This code assumes the original list is ordered from smallest to largest
it works recursively where in each call the input is what is still left of the list, prev holds the previous head of the list (input.head) and splits holds the splits so far
in each call, we look at the input (what's left of the list)
if it is empty (Nil) we finished the split and we return the splits
the other two options the match uses pattern matching to
break the input into header and tail (h and t respectively)
the second match uses a guard condition (the if) to check if the head of the input belongs in the latest split if it does it prepends it to the split
the last option is to create a new split
def partition(input :List[Int] // a sorted List of Ints
,prev :Int // Int previously added to the accumulator
,splits :List[List[Int]] // accumulator of Ints for eventual output
): List[List[Int]] = { // the output (same type as accumulator)
input match { // what does input look like?
case Nil => splits // input is empty, return the accumulator
// input has a head and tail, head is close to previous Int
case h :: t if h-prev < 2 =>
// start again with new input (current tail), new previous (current head),
// and the current head inserted into accumulator
partition(t, h, (h :: splits.head) :: splits.tail)
// input has a head and tail, head is not close to previous Int
case h :: t =>
// start again with new input (current tail), new previous (current head),
// and the current head is the start of a new sub-list in the accumulator
partition(t, h, List(h) :: splits)
}
}

RDD/Scala Get one column from RDD

I have an RDD[Log] file with various fields (username,content,date,bytes) and I want to find different things for each field/column.
For example, I want to get the min/max and average bytes found in the RDD. When i do:
val q1 = cleanRdd.filter(x => x.bytes != 0)
I get the full lines of the RDD with bytes != 0. But how can I actually sum them, calculate the avg, find the min/max etc? How can I take only one column from my RDD and apply transformations on it?
EDIT: Prasad told me about changing the type to dataframe, he gave no instructions on how to do so though, and I cant find a solid answer on the site. Any help would be great.
EDIT: LOG class:
case class Log (username: String, date: String, status: Int, content: Int)
using a cleanRdd.take(5).foreach(println) gives something like this
Log(199.72.81.55 ,01/Jul/1995:00:00:01 -0400,200,6245)
Log(unicomp6.unicomp.net ,01/Jul/1995:00:00:06 -0400,200,3985)
Log(199.120.110.21 ,01/Jul/1995:00:00:09 -0400,200,4085)
Log(burger.letters.com ,01/Jul/1995:00:00:11 -0400,304,0)
Log(199.120.110.21 ,01/Jul/1995:00:00:11 -0400,200,4179)
Well... you have a lot of questions.
So... you have the following abstraction of a Log
case class Log (username: String, date: String, status: Int, content: Int, byte: Int)
Que - How can I take only one column from my RDD.
Ans - You have a map function with the RDD's. So for an RDD[A], map takes a map/transform function of type A => B to transform it into a RDD[B].
val logRdd: RDD[Log] = ...
val byteRdd = logRdd
.filter(l => l.bytes != 0)
.map(l => l.byte)
Que - how can I actually sum them ?
Ans - You can do it by using reduce / fold / aggregate.
val sum = byteRdd.reduce((acc, b) => acc + b)
val sum = byteRdd.fold(0)((acc, b) => acc + b)
val sum = byteRdd.aggregate(0)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Note :: An important thing to notice here is that a sum of Int can grow bigger than what an Int can handle. So in most real life cases we should use at least a Long as our accumulator instead of an Int, which actually removes reduce and fold as options. And we will be left with an aggregate only.
val sum = byteRdd.aggregate(0l)(
(acc, b) => acc + b,
(acc1, acc2) => acc1 + acc2
)
Now if you have to calculate multiple things like min, max, avg then I will suggest that you calculate them in a single aggregate instead of multiple like this,
// (count, sum, min, max)
val accInit = (0, 0, Int.MaxValue, Int.MinValue)
val (count, sum, min, max) = byteRdd.aggregate(accInit)(
{ case ((count, sum, min, max), b) =>
(count + 1, sum + b, Math.min(min, b), Math.max(max, b)) },
{ case ((count1, sum1, min1, max1), (count2, sum2, min2, max2)) =>
(count1 + count2, sum1 + sum2, Math.min(min1, min2), Math.max(max1, max2)) }
})
val avg = sum.toDouble / count
Have a look in DataFrame API. You need to convert your RDD to a DataFrame and then you can use min, max, avg functions like below:
val rdd = cleanRdd.filter(x => x.bytes != 0)
val df = sparkSession.sqlContext.createDataFrame(rdd, classOf[Log])
Assuming you wanted to operations on column bytes then
import org.apache.spark.sql.functions._
df.select(avg("bytes")).show
df.select(min("bytes")).show
df.select(max("bytes")).show
Update:
Tried with the following in spark-shell. check the screenshots for the outcome...
case class Log (username: String, date: String, status: Int, content: Int)
val inputRDD = sc.parallelize(Seq(Log("199.72.81.55","01/Jul/1995:00:00:01 -0400",200,6245), Log("unicomp6.unicomp.net","01/Jul/1995:00:00:06 -0400",200,3985), Log("199.120.110.21","01/Jul/1995:00:00:09 -0400",200,4085), Log("burger.letters.com","01/Jul/1995:00:00:11 -0400",304,0), Log("199.120.110.21","01/Jul/1995:00:00:11 -0400",200,4179)))
val rdd = inputRDD.filter(x => x.content != 0)
val df = rdd.toDF("username", "date", "status", "content")
df.printSchema
import org.apache.spark.sql.functions._
df.select(avg("content")).show
df.select(min("content")).show
df.select(max("content")).show

Filling date gaps in Anorm results

i'm new to Scala, Play and Anorm, so I'm wondering how can I do this.
I have a query to my database, which returns a date, with a DD/MM HH:OO format, and a Long, which is a total.
I want to display a total per hour graph, so I create a byhour parser:
val byhour = {
get[Option[String]]("date") ~ get[Long]("total") map {
case date ~ total => (date, total)
}
And this, of course, only returns the dates where I have data. I want to fill the date gaps with the date and a total of 0, but I'm not sure how to do it.
Thanks in advance!
edit: I know it's possible to do this in MySQL, but I'd prefer to do this in Scala itself to keep the queries clean.
I don't think that related to Anorm directly, which there will allow you to fill gaps among parsed results afterward.
First option you get unordered result as List[(String, Long)] using .as(byhour.*), sort it by date and then fill with zero for missing date.
SQL"...".as(byhour.*).sortBy(_._1).
foldLeft(List.empty[(String, Long)]) {
case (p :: l, (d, t)) =>
(d, t) :: prefill(p, d, l)
case (l, (d, t)) =>
(d, t) :: l // assert l == Nil
}.reverse
/**
* #param p Previous/last tuple
* #param d Current/new date
* #param l List except `p`
* #return List based on `l` with `p` prepended and eventually before with some filler tuple prepended.
*/
def prefill(p: (String, Long), d: String, l: List[(String, Long)]): List[(String, Long)] = ???
Otherwise if you query returns results ordered by date you can use Anorm streaming API and fill gap as soon as it's discovered.
// Anorm 2.3
import anorm.Success
SQL"... ORDER BY date ASC".apply().
foldLeft(List.empty[(String, Long)]) {
case (l, row) =>
byhour(row) match {
case Success((d, t)) =>
l match {
case p :: ts =>
(d, t) :: prefill(p, d, l)
case _ => (d, t) :: l
}
case _ => ??? // parse error
}
}.reverse

How to functionally merge overlapping number-ranges from a List

I have a number of range-objects which I need to merge so that all overlapping ranges disappear:
case class Range(from:Int, to:Int)
val rangelist = List(Range(3, 40), Range(1, 45), Range(2, 50), etc)
Here is the ranges:
3 40
1 45
2 50
70 75
75 90
80 85
100 200
Once finished we would get:
1 50
70 90
100 200
Imperative Algorithm:
Pop() the first range-obj and iterate through the rest of the list comparing it with each of the other ranges.
if there is an overlapping item,
merge them together ( This yields a new Range instance ) and delete the 2 merge-candidates from the source-list.
At the end of the list add the Range object (which could have changed numerous times through merging) to the final-result-list.
Repeat this with the next of the remaining items.
Once the source-list is empty we're done.
To do this imperatively one must create a lot of temporary variables, indexed loops etc.
So I'm wondering if there is a more functional approach?
At first sight the source-collection must be able to act like a Stack in providing pop() PLUS
giving the ability to delete items by index while iterating over it, but then that would not be that functional anymore.
Try tail-recursion. (Annotation is needed only to warn you if tail-recursion optimization doesn't happen; the compiler will do it if it can whether you annotate or not.)
import annotation.{tailrec => tco}
#tco final def collapse(rs: List[Range], sep: List[Range] = Nil): List[Range] = rs match {
case x :: y :: rest =>
if (y.from > x.to) collapse(y :: rest, x :: sep)
else collapse( Range(x.from, x.to max y.to) :: rest, sep)
case _ =>
(rs ::: sep).reverse
}
def merge(rs: List[Range]): List[Range] = collapse(rs.sortBy(_.from))
I love these sorts of puzzles:
case class Range(from:Int, to:Int) {
assert(from <= to)
/** Returns true if given Range is completely contained in this range */
def contains(rhs: Range) = from <= rhs.from && rhs.to <= to
/** Returns true if given value is contained in this range */
def contains(v: Int) = from <= v && v <= to
}
def collapse(rangelist: List[Range]) =
// sorting the list puts overlapping ranges adjacent to one another in the list
// foldLeft runs a function on successive elements. it's a great way to process
// a list when the results are not a 1:1 mapping.
rangelist.sortBy(_.from).foldLeft(List.empty[Range]) { (acc, r) =>
acc match {
case head :: tail if head.contains(r) =>
// r completely contained; drop it
head :: tail
case head :: tail if head.contains(r.from) =>
// partial overlap; expand head to include both head and r
Range(head.from, r.to) :: tail
case _ =>
// no overlap; prepend r to list
r :: acc
}
}
Here's my solution:
def merge(ranges:List[Range]) = ranges
.sortWith{(a, b) => a.from < b.from || (a.from == b.from && a.to < b.to)}
.foldLeft(List[Range]()){(buildList, range) => buildList match {
case Nil => List(range)
case head :: tail => if (head.to >= range.from) {
Range(head.from, head.to.max(range.to)) :: tail
} else {
range :: buildList
}
}}
.reverse
merge(List(Range(1, 3), Range(4, 5), Range(10, 11), Range(1, 6), Range(2, 8)))
//List[Range] = List(Range(1,8), Range(10,11))
I ran into this need for Advent of Code 2022, Day 15, where I needed to merge a list of inclusive ranges. I had to slightly modify the solution for inclusiveness:
import annotation.{tailrec => tco}
#tco final def collapse(rs: List[Range], sep: List[Range] = Nil): List[Range] = rs match {
case x :: y :: rest =>
if (y.start - 1 > x.end) collapse(y :: rest, x :: sep)
else collapse(Range.inclusive(x.start, x.end max y.end) :: rest, sep)
case _ =>
(rs ::: sep).reverse
}