Grouping list items by comparing them with their neighbors - scala

What is the most elegant way of grouping a list of values into groups based on their neighbor values?
The wider context I have is having a list of lines, that need to be grouped into paragraphs. I want to be able to say that if the vertical difference between two lines is lower than threshold, they are in the same paragraph.
I ended up solving this problem differently, but I'm wondering about the correct solution here.
case class Box(y: Int)
val list = List(Box(y=1), Box(y=2), Box(y=5))
def group(list: List[Box], threshold: Int): List[List[Box]] = ???
val grouped = group(list, 2)
> List(List(Box(y=1), Box(y=2)), List(Box(y=5)))
I have looked at groupBy(), but that can only work with one element at a time. I have also tried an approach that involved pre-computing differences using sliding(), but then it becomes awkward to retrieve the elements from the original collection.

It's a one liner. Generalising types left as an exercise for the reader.
Using ints and absolute difference rather than lines and spacing to avoid clutter.
val zs = List(1,2,4,8,9,10,15,16)
def closeEnough(a:Int, b:Int) = (Math.abs(b -a) <= 2)
zs.drop(1).foldLeft(List(List(zs.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
.map(_.reverse)
.reverse
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))
Or a two liner for a slight efficiency gain
val ys = zs.reverse
ys.drop(1).foldLeft(List(List(ys.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))

Related

Functional Programming way to calculate something like a rolling sum

Let's say I have a list of numerics:
val list = List(4,12,3,6,9)
For every element in the list, I need to find the rolling sum, i,e. the final output should be:
List(4, 16, 19, 25, 34)
Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both?
Something like map(initial)((curr,prev) => curr+prev)
I want to achieve this without maintaining any shared global state.
EDIT: I would like to be able to do the same kinds of computation on RDDs.
You may use scanLeft
list.scanLeft(0)(_ + _).tail
The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
val num = implicitly[Numeric[N]]
val nPartitions = rdd.partitions.length
val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) =>
if (index == nPartitions - 1) Iterator.empty
else Iterator.single(iter.foldLeft(num.zero)(num.plus))
).collect
.scanLeft(num.zero)(num.plus)
rdd.mapPartitionsWithIndex((index, iter) =>
if (iter.isEmpty) iter
else {
val start = num.plus(partitionCumSums(index), iter.next)
iter.scanLeft(start)(num.plus)
}
)
}
It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.
I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):
list.zipWithIndex.map{x => list.take(x._2+1).sum}
This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).
When printing it, I get the following:
List(4, 16, 19, 25, 34)

Split sorted Scala Sequence/Array according to gaps between elements [duplicate]

What is the most elegant way of grouping a list of values into groups based on their neighbor values?
The wider context I have is having a list of lines, that need to be grouped into paragraphs. I want to be able to say that if the vertical difference between two lines is lower than threshold, they are in the same paragraph.
I ended up solving this problem differently, but I'm wondering about the correct solution here.
case class Box(y: Int)
val list = List(Box(y=1), Box(y=2), Box(y=5))
def group(list: List[Box], threshold: Int): List[List[Box]] = ???
val grouped = group(list, 2)
> List(List(Box(y=1), Box(y=2)), List(Box(y=5)))
I have looked at groupBy(), but that can only work with one element at a time. I have also tried an approach that involved pre-computing differences using sliding(), but then it becomes awkward to retrieve the elements from the original collection.
It's a one liner. Generalising types left as an exercise for the reader.
Using ints and absolute difference rather than lines and spacing to avoid clutter.
val zs = List(1,2,4,8,9,10,15,16)
def closeEnough(a:Int, b:Int) = (Math.abs(b -a) <= 2)
zs.drop(1).foldLeft(List(List(zs.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
.map(_.reverse)
.reverse
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))
Or a two liner for a slight efficiency gain
val ys = zs.reverse
ys.drop(1).foldLeft(List(List(ys.head)))
((acc, e)=> if (closeEnough(e, acc.head.head))
(e::acc.head)::acc.tail
else
List(e)::acc)
// List(List(1, 2, 4), List(8, 9, 10), List(15, 16))

How to traverse array from both left to right and from right to left?

Suppose I have an imperative algorithm that keeps two indices left and right and moves them from left to right and from right to left
var left = 0
var right = array.length - 1
while (left < right) { .... } // move left and right inside the loop
Now I would like to write this algorithm without mutable indices.
How can I do that ? Do you have any examples of such algorithms ? I would prefer a non-recursive approach.
You can map pairs of elements between your list and its reverse, then go from left to right through that list of pairs and keep taking as long as your condition is satisfied:
val list = List(1, 2, 3, 4, 5)
val zipped = list zip list.reverse
val filtered = zipped takeWhile { case (a, b) => (a < b) }
Value of filtered is List((1, 5), (2, 4)).
Now you can do whatever you need with those elements:
val result = filtered map {
case (a, b) =>
// do something with each left-right pair, e.g. sum them
a + b
}
println(result) // List(6, 6)
If you need some kind of context dependant operation (that is, each
iteration depends on the result of the previous one) then you have to
use a more powerful abstraction (monad), but let's not go there if
this is enough for you. Even better would be to simply use recursion, as pointed out by others, but you said that's not an option.
EDIT:
Version without extra pass for reversing, only constant-time access for elem(length - index):
val list = List(1, 2, 3, 4, 5)
val zipped = list.view.zipWithIndex
val filtered = zipped takeWhile { case (a, index) => (a < list(list.length - 1 - index)) }
println(filtered.toList) // List((1, 0), (2, 1))
val result = filtered map {
case (elem, index) => // do something with each left-right pair, e.g. sum them
val (a, b) = (elem, list(list.length - 1 - index))
a + b
}
println(result.toList) // List(6, 6)
Use reverseIterator:
scala> val arr = Array(1,2,3,4,5)
arr: Array[Int] = Array(1, 2, 3, 4, 5)
scala> arr.iterator.zip(arr.reverseIterator).foreach(println)
(1,5)
(2,4)
(3,3)
(4,2)
(5,1)
This function is efficient on IndexedSeq collections, which Array is implicitly convertible to.
It really depends on what needs to be done at each iteration, but here's something to think about.
array.foldRight(0){case (elem, index) =>
if (index < array.length/2) {
/* array(index) and elem are opposite elements in the array */
/* do whatever (note: requires side effects) */
index+1
} else index // do nothing
} // ignore result
Upside: Traverse the array only once and no mutable variables.
Downside: Requires side effects (but that was implied in your example). Also, it'd be better if it traversed only half the array, but that would require early breakout and Scala doesn't offer an easy/elegant solution for that.
myarray = [1,2,3,4,5,6]
rmyarray = myarray[::-1]
Final_Result = []
for i in range(len(myarray)//2):
Final_Result.append(myarray[i])
Final_Result.append(rmyarray[i])
print(Final_Result)
# This is the simple approach I think 😉.

Scala Seq.sliding() violating the docs rationale?

When writing tests for some part of my system I found some weird behavior, which upon closer inspection boils down to the following:
scala> List(0, 1, 2, 3).sliding(2).toList
res36: List[List[Int]] = List(List(0, 1), List(1, 2), List(2, 3))
scala> List(0, 1, 2).sliding(2).toList
res37: List[List[Int]] = List(List(0, 1), List(1, 2))
scala> List(0, 1).sliding(2).toList
res38: List[List[Int]] = List(List(0, 1))
scala> List(0).sliding(2).toList //I mean the result of this line
res39: List[List[Int]] = List(List(0))
To me it seems like List.sliding(), and the sliding() implementations for a number of other types are violating the guarantees given in the docs:
def sliding(size: Int): Iterator[List[A]]
Groups elements in fixed size blocks by passing a "sliding window"
over them (as opposed to partitioning them, as is done in grouped.)
size: the number of elements per group
returns: An iterator producing lists of size size, except the last and the only element will be truncated if there are fewer
elements than size.
From what I understand there is a guarantee that all the lists that can be iterated over using the iterator returned by sliding(2) will be of length 2. I find it hard to believe that this is a bug that got all the way to the current version of scala, so perhaps there's an explanation for this or I'm misunderstanding the docs?
I'm using "Scala version 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_25)."
No, there's is no such guarantee, and your pretty much emphasized the doc line that explicitly says so. Here it is again, with a different emphasis:
returns: An iterator producing lists of size size, except the last and
the only element will be truncated if there are fewer elements than
size.
So if you have a list that has length n, and call .sliding(m), where m > n, the last and the only element of the result with have length n.
In the case of:
List(0).sliding(2)
there is only one element (n = 1), you call sliding(2), i.e. m = 2, 2 > 1, this causes the last and only element of the result to be truncated to 1.

Seq with maximal elements

I have a Seq and function Int => Int. What I need to achieve is to take from original Seq only thoose elements that would be equal to the maximum of the resulting sequence (the one, I'll have after applying given function):
def mapper:Int=>Int= x=>x*x
val s= Seq( -2,-2,2,2 )
val themax= s.map(mapper).max
s.filter( mapper(_)==themax)
But this seems wasteful, since it has to map twice (once for the filter, other for the maximum).
Is there a better way to do this? (without using a cycle, hopefully)
EDIT
The code has since been edited; in the original this was the filter line: s.filter( mapper(_)==s.map(mapper).max). As om-nom-nom has pointed out, this evaluates `s.map(mapper).max each (filter) iteration, leading to quadratic complexity.
Here is a solution that does the mapping only once and using the `foldLeft' function:
The principle is to go through the seq and for each mapped element if it is greater than all mapped before then begin a new sequence with it, otherwise if it is equal return the list of all maximums and the new mapped max. Finally if it is less then return the previously computed Seq of maximums.
def getMaxElems1(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = s.foldLeft(Seq[(Int,Int)]())((res, elem) => {
val e2 = mapper(elem)
if(res.isEmpty || e2>res.head._2)
Seq((elem,e2))
else if (e2==res.head._2)
res++Seq((elem,e2))
else res
}).map(_._1) // keep only original elements
// test with your list
scala> getMaxElems1(s)(mapper)
res14: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems1(Seq(-1, 2,0, -2, 1,-2))(mapper)
res15: Seq[Int] = List(2, -2, -2)
Remark: About complexity
The algorithm I present above has a complexity of O(N) for a list with N elements. However:
the operation of mapping all elements is of complexity O(N)
the operation of computing the max is of complexity O(N)
the operation of zipping is of complexity O(N)
the operation of filtering the list according to the max is also of complexity O(N)
the operation of mapping all elements is of complexity O(M), with M the number of final elements
So, finally the algorithm you presented in your question has the same complexity (quality) than my answer's one, moreover the solution you present is more clear than mine. So, even if the 'foldLeft' is more powerful, for this operation I would recommend your idea, but with zipping original list and computing the map only once (especially if your map is more complicated than a simple square). Here is the solution computed with the help of *scala_newbie* in question/chat/comments.
def getMaxElems2(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = {
val mappedS = s.map(mapper) //map done only once
val m = mappedS.max // find the max
s.zip(mappedS).filter(_._2==themax).unzip._1
}
// test with your list
scala> getMaxElems2(s)(mapper)
res16: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems2(Seq(-1, 2,0, -2, 1,-2))(mapper)
res17: Seq[Int] = List(2, -2, -2)