Scala - grouping on an ordered iterator lazily

Scala - grouping on an ordered iterator lazily - scala

I have an Iterator[Record] which is ordered on record.id this way:
record.id=1
record.id=1
...
record.id=1
record.id=2
record.id=2
..
record.id=2
Records of a specific ID could occur a large number of times, so I want to write a function that takes this iterator as input, and returns an Iterator[Iterator[Record]] output in a lazy manner.
I was able to come up with the following, but it fails on StackOverflowError after 500K records or so:
def groupByIter[T, B](iterO: Iterator[T])(func: T => B): Iterator[Iterator[T]] = new Iterator[Iterator[T]] {
var iter = iterO
def hasNext = iter.hasNext
def next() = {
val first = iter.next()
val firstValue = func(first)
val (i1, i2) = iter.span(el => func(el) == firstValue)
iter = i2
Iterator(first) ++ i1
}
}
What am I doing wrong?

Trouble here is that each Iterator.span call makes another stacked closure for trailing iterator, and without any trampolining it's very easy to overflow.
Actually I dont think there is an implementation, which is not memoizing elements of prefix iterator, since followed iterator could be accessed earlier than prefix is drain out.
Even in .span implementation there is a Queue to memoize elements in the Leading definition.
So easiest implementation that I could imagine is the following via Stream.
implicit class StreamChopOps[T](xs: Stream[T]) {
def chopBy[U](f: T => U): Stream[Stream[T]] = xs match {
case x #:: _ =>
def eq(e: T) = f(e) == f(x)
xs.takeWhile(eq) #:: xs.dropWhile(eq).chopBy(f)
case _ => Stream.empty
}
}
Although it could be not the most performant as it memoize a lot. But with proper iterating of that, GC should handle problem of excess intermediate streams.
You could use it as myIterator.toStream.chopBy(f)
Simple check validates that following code can run without SO
Iterator.fill(10000000)(Iterator(1,1,2)).flatten //1,1,2,1,1,2,...
.toStream.chopBy(identity) //(1,1),(2),(1,1),(2),...
.map(xs => xs.sum * xs.size).sum //60000000

Inspired by chopBy implemented by #Odomontois here is a chopBy I implemented for Iterator. Of course each bulk should fit allocated memory. It doesn't looks very elegant but it seems to work :)
implicit class IteratorChopOps[A](toChopIter: Iterator[A]) {
def chopBy[U](f: A => U) = new Iterator[Traversable[A]] {
var next_el: Option[A] = None
#tailrec
private def accum(acc: List[A]): List[A] = {
next_el = None
val new_acc = hasNext match {
case true =>
val next = toChopIter.next()
acc match {
case Nil =>
acc :+ next
case _ MatchTail t if (f(t) == f(next)) =>
acc :+ next
case _ =>
next_el = Some(next)
acc
}
case false =>
next_el = None
return acc
}
next_el match{
case Some(_) =>
new_acc
case None => accum(new_acc)
}
}
def hasNext = {
toChopIter.hasNext || next_el.isDefined
}
def next: Traversable[A] = accum(next_el.toList)
}
}
And here is an extractor for matching tail:
object MatchTail {
def unapply[A] (l: Traversable[A]) = Some( (l.init, l.last) )
}

Related

Convert Traversable[T] to Stream[T] without traversing or stack overflow

I am using a library that provides a Traversable[T] that pages through database results. I'd like to avoid loading the whole thing into memory, so I am trying to convert it to a Stream[T].
From what I can tell, the built in "asStream" method loads the whole Traversable into a Buffer, which defeats my purpose. My attempt (below) hits a StackOverflowException on large results, and I can't tell why. Can someone help me understand what is going on? Thanks!
def asStream[T](traversable: => Traversable[T]): Stream[T] = {
if (traversable.isEmpty) Empty
else {
lazy val head = traversable.head
lazy val tail = asStream(traversable.tail)
head #:: tail
}
}
Here's a complete example that reproduces this, based on a suggestion by #SCouto
import scala.collection.immutable.Stream.Empty
object StreamTest {
def main(args: Array[String]) = {
val bigVector = Vector.fill(90000)(1)
val optionStream = asStream(bigVector).map(v => Some(v))
val zipped = optionStream.zipAll(optionStream.tail, None, None)
}
def asStream[T](traversable: => Traversable[T]): Stream[T] = {
#annotation.tailrec
def loop(processed: => Stream[T], pending: => Traversable[T]): Stream[T] = {
if (pending.isEmpty) processed
else {
lazy val head = pending.head
lazy val tail = pending.tail
loop(processed :+ head, tail)
}
}
loop(Empty, traversable)
}
}
Edit: After some interesting ideas from #SCouto, I learned this could also be done with trampolines to keep the result as a Stream[T] that is in the original order
object StreamTest {
def main(args: Array[String]) = {
val bigVector = Range(1, 90000).toVector
val optionStream = asStream(bigVector).map(v => Some(v))
val zipped = optionStream.zipAll(optionStream.tail, None, None)
zipped.take(10).foreach(println)
}
def asStream[T](traversable: => Traversable[T]): Stream[T] = {
sealed trait Traversal[+R]
case class More[+R](result: R, next: () => Traversal[R]) extends Traversal[R]
case object Done extends Traversal[Nothing]
def next(currentTraversable: Traversable[T]): Traversal[T] = {
if (currentTraversable.isEmpty) Done
else More(currentTraversable.head, () => next(currentTraversable.tail))
}
def trampoline[R](body: => Traversal[R]): Stream[R] = {
def loop(thunk: () => Traversal[R]): Stream[R] = {
thunk.apply match {
case More(result, next) => Stream.cons(result, loop(next))
case Done => Stream.empty
}
}
loop(() => body)
}
trampoline(next(traversable))
}
}

Try this:
def asStream[T](traversable: => Traversable[T]): Stream[T] = {
#annotation.tailrec
def loop(processed: Stream[T], pending: Traversable[T]): Stream[T] = {
if (pending.isEmpty) processed
else {
lazy val head = pending.head
lazy val tail = pending.tail
loop(head #:: processed, tail)
}
}
loop(Empty, traversable)
}
The main point is to ensure that your recursive call is the last action of your recursive function.
To ensure this you can use both a nested method (called loop in the example) and the tailrec annotation which ensures your method is tail-safe.
You can find info about tail rec here and in this awesome answer here
EDIT
The problem was that we were adding the element at the end of the Stream. If you add it as head of the Stream as in your example it will work fine. I updated my code. Please test it and let us know the result.
My tests:
scala> val optionStream = asStream(Vector.fill(90000)(1)).map(v => Some(v))
optionStream: scala.collection.immutable.Stream[Some[Int]] = Stream(Some(1), ?)
scala> val zipped = optionStream.zipAll(optionStream.tail, None, None)
zipped: scala.collection.immutable.Stream[(Option[Int], Option[Int])] = Stream((Some(1),Some(1)), ?)
EDIT2:
According to your comments, and considering the fpinscala example as you said. I think this may help you. The point is creating a case class structure with lazy evaluation. Where the head is a single element, and the tail a traversable
sealed trait myStream[+T] {
def head: Option[T] = this match {
case MyEmpty => None
case MyCons(h, _) => Some(h())
}
def tail: myStream[T] = this match {
case MyEmpty => MyEmpty
case MyCons(_, t) => myStream.cons(t().head, t().tail)
}
}
case object MyEmpty extends myStream[Nothing]
case class MyCons[+T](h: () => T, t: () => Traversable[T]) extends myStream[T]
object myStream {
def cons[T](hd: => T, tl: => Traversable[T]): myStream[T] = {
lazy val head = hd
lazy val tail = tl
MyCons(() => head, () => tail)
}
def empty[T]: myStream[T] = MyEmpty
def apply[T](as: T*): myStream[T] = {
if (as.isEmpty) empty
else cons(as.head, as.tail)
}
}
Some Quick tests:
val bigVector = Vector.fill(90000)(1)
myStream.cons(bigVector.head, bigVector.tail)
res2: myStream[Int] = MyCons(<function0>,<function0>)
Retrieving head:
res2.head
res3: Option[Int] = Some(1)
And the tail:
res2.tail
res4: myStream[Int] = MyCons(<function0>,<function0>)
EDIT3
The trampoline solution by the op:
def asStream[T](traversable: => Traversable[T]): Stream[T] = {
sealed trait Traversal[+R]
case class More[+R](result: R, next: () => Traversal[R]) extends Traversal[R]
case object Done extends Traversal[Nothing]
def next(currentTraversable: Traversable[T]): Traversal[T] = {
if (currentTraversable.isEmpty) Done
else More(currentTraversable.head, () => next(currentTraversable.tail))
}
def trampoline[R](body: => Traversal[R]): Stream[R] = {
def loop(thunk: () => Traversal[R]): Stream[R] = {
thunk.apply match {
case More(result, next) => Stream.cons(result, loop(next))
case Done => Stream.empty
}
}
loop(() => body)
}
trampoline(next(traversable))
}
}

Stream doesn't keep the data in memory because you declare how to generate each item. It's very likely that your database data is not been procedurally generated so what you need is to fetch the data the first time you ask for it (something like def getData(index: Int): Future[Data]).
The biggest problem rise in, since you are fetching data from a database, you are probably using Futures so, even if you are able to achieve it, you would have a Future[Stream[Data]] object which is not that nice to use or, much worst, block it.
Wouldn't be much more worthy just to paginate your database data query?

Rewriting imperative for loop to declarative style in Scala

How do I rewrite the following loop (pattern) into Scala, either using built-in higher order functions or tail recursion?
This the example of an iteration pattern where you do a computation (comparison, for example) of two list elements, but only if the second one comes after first one in the original input. Note that the +1 step is used here, but in general, it could be +n.
public List<U> mapNext(List<T> list) {
List<U> results = new ArrayList();
for (i = 0; i < list.size - 1; i++) {
for (j = i + 1; j < list.size; j++) {
results.add(doSomething(list[i], list[j]))
}
}
return results;
}
So far, I've come up with this in Scala:
def mapNext[T, U](list: List[T])(f: (T, T) => U): List[U] = {
#scala.annotation.tailrec
def loop(ix: List[T], jx: List[T], res: List[U]): List[U] = (ix, jx) match {
case (_ :: _ :: is, Nil) => loop(ix, ix.tail, res)
case (i :: _ :: is, j :: Nil) => loop(ix.tail, Nil, f(i, j) :: res)
case (i :: _ :: is, j :: js) => loop(ix, js, f(i, j) :: res)
case _ => res
}
loop(list, Nil, Nil).reverse
}
Edit:
To all contributors, I only wish I could accept every answer as solution :)

Here's my stab. I think it's pretty readable. The intuition is: for each head of the list, apply the function to the head and every other member of the tail. Then recurse on the tail of the list.
def mapNext[U, T](list: List[U], fun: (U, U) => T): List[T] = list match {
case Nil => Nil
case (first :: Nil) => Nil
case (first :: rest) => rest.map(fun(first, _: U)) ++ mapNext(rest, fun)
}
Here's a sample run
scala> mapNext(List(1, 2, 3, 4), (x: Int, y: Int) => x + y)
res6: List[Int] = List(3, 4, 5, 5, 6, 7)
This one isn't explicitly tail recursive but an accumulator could be easily added to make it.

Recursion is certainly an option, but the standard library offers some alternatives that will achieve the same iteration pattern.
Here's a very simple setup for demonstration purposes.
val lst = List("a","b","c","d")
def doSomething(a:String, b:String) = a+b
And here's one way to get at what we're after.
val resA = lst.tails.toList.init.flatMap(tl=>tl.tail.map(doSomething(tl.head,_)))
// resA: List[String] = List(ab, ac, ad, bc, bd, cd)
This works but the fact that there's a map() within a flatMap() suggests that a for comprehension might be used to pretty it up.
val resB = for {
tl <- lst.tails
if tl.nonEmpty
h = tl.head
x <- tl.tail
} yield doSomething(h, x) // resB: Iterator[String] = non-empty iterator
resB.toList // List(ab, ac, ad, bc, bd, cd)
In both cases the toList cast is used to get us back to the original collection type, which might not actually be necessary depending on what further processing of the collection is required.

Comeback Attempt:
After deleting my first attempt to give an answer I put some more thought into it and came up with another, at least shorter solution.
def mapNext[T, U](list: List[T])(f: (T, T) => U): List[U] = {
#tailrec
def loop(in: List[T], out: List[U]): List[U] = in match {
case Nil => out
case head :: tail => loop(tail, out ::: tail.map { f(head, _) } )
}
loop(list, Nil)
}
I would also like to recommend the enrich my library pattern for adding the mapNext function to the List api (or with some adjustments to any other collection).
object collection {
object Implicits {
implicit class RichList[A](private val underlying: List[A]) extends AnyVal {
def mapNext[U](f: (A, A) => U): List[U] = {
#tailrec
def loop(in: List[A], out: List[U]): List[U] = in match {
case Nil => out
case head :: tail => loop(tail, out ::: tail.map { f(head, _) } )
}
loop(underlying, Nil)
}
}
}
}
Then you can use the function like:
list.mapNext(doSomething)
Again, there is a downside, as concatenating lists is relatively expensive.
However, variable assignemends inside for comprehensions can be quite inefficient, too (as this improvement task for dotty Scala Wart: Convoluted de-sugaring of for-comprehensions suggests).
UPDATE
Now that I'm into this, I simply cannot let go :(
Concerning 'Note that the +1 step is used here, but in general, it could be +n.'
I extended my proposal with some parameters to cover more situations:
object collection {
object Implicits {
implicit class RichList[A](private val underlying: List[A]) extends AnyVal {
def mapNext[U](f: (A, A) => U): List[U] = {
#tailrec
def loop(in: List[A], out: List[U]): List[U] = in match {
case Nil => out
case head :: tail => loop(tail, out ::: tail.map { f(head, _) } )
}
loop(underlying, Nil)
}
def mapEvery[U](step: Int)(f: A => U) = {
#tailrec
def loop(in: List[A], out: List[U]): List[U] = {
in match {
case Nil => out.reverse
case head :: tail => loop(tail.drop(step), f(head) :: out)
}
}
loop(underlying, Nil)
}
def mapDrop[U](drop1: Int, drop2: Int, step: Int)(f: (A, A) => U): List[U] = {
#tailrec
def loop(in: List[A], out: List[U]): List[U] = in match {
case Nil => out
case head :: tail =>
loop(tail.drop(drop1), out ::: tail.drop(drop2).mapEvery(step) { f(head, _) } )
}
loop(underlying, Nil)
}
}
}
}

list // [a, b, c, d, ...]
.indices // [0, 1, 2, 3, ...]
.flatMap { i =>
elem = list(i) // Don't redo access every iteration of the below map.
list.drop(i + 1) // Take only the inputs that come after the one we're working on
.map(doSomething(elem, _))
}
// Or with a monad-comprehension
for {
index <- list.indices
thisElem = list(index)
thatElem <- list.drop(index + 1)
} yield doSomething(thisElem, thatElem)
You start, not with the list, but with its indices. Then, you use flatMap, because each index goes to a list of elements. Use drop to take only the elements after the element we're working on, and map that list to actually run the computation. Note that this has terrible time complexity, because most operations here, indices/length, flatMap, map, are O(n) in the list size, and drop and apply are O(n) in the argument.
You can get better performance if you a) stop using a linked list (List is good for LIFO, sequential access, but Vector is better in the general case), and b) make this a tiny bit uglier
val len = vector.length
(0 until len)
.flatMap { thisIdx =>
val thisElem = vector(thisIdx)
((thisIdx + 1) until len)
.map { thatIdx =>
doSomething(thisElem, vector(thatIdx))
}
}
// Or
val len = vector.length
for {
thisIdx <- 0 until len
thisElem = vector(thisIdx)
thatIdx <- (thisIdx + 1) until len
thatElem = vector(thatIdx)
} yield doSomething(thisElem, thatElem)
If you really need to, you can generalize either version of this code to all IndexedSeqs, by using some implicit CanBuildFrom parameters, but I won't cover that.

How to stop Stream to evaluate next element and get accumulated result in a functional way

I have this code and I want to let Stream to stop iteration and also get the accumulated result. Basically, the iteration is based on errorLimit number
sealed trait Ele
case class FailureEle() extends Ele
case class SuccessEle() extends Ele
type EitherResult = Either[IndexedSeq[Ele], Seq[FailureEle]]
def parse(process: Process[Task, Ele], errorLimit: Int): EitherResult = {
val errorAccumulator = new ListBuffer[FailureEle]
val taskProcess = process.map(t => {
t match {
case x: FailureEle => errorAccumulator += x
case _ =>
}
t
}).takeWhile(_ => !(errorAccumulator.size == errorLimit))
val voSeq = taskProcess.runLog.run
if (errorAccumulator.isEmpty) {
Left(voSeq)
} else {
Right(errorAccumulator)
}
}
val result = Seq(FailureEle(), SuccessEle(), FailureEle(), SuccessEle(), SuccessEle(), FailureEle(), SuccessEle())
val adaptor = new SeqAdaptor[Ele](result)
val process: Process[Task, Ele] = Process
.repeatEval(Task {adaptor.next()}).takeWhile(t => !t.shouldStop).map(_.get)
parse(process, 1).isRight //no SuccessEle will be iterated
parse(process, 2).isRight //only one SuccessEle will be iterated
parse(process, 3).isRight //the last one SuccessEle will not be iterated
It is working, but there are several issues that I want to refactor the parse method to be more functional:
ListBuffer is an imperative way
takeWhile condition has no logic to check current element, it is still using ListBuffer result
so I wonder is there a tail recursion way to replace the imperative way by using ListBuffer.

scan may not be better enough, but works
sealed trait Ele
case class FailureEle(e: Throwable) extends Ele
case class SuccessEle(r: String) extends Ele
def parse(p: Process[Task, Ele], error: Int): Process[Task, (Seq[SuccessEle], Seq[FailureEle])] = {
p.scan(Seq[SuccessEle]() -> Seq[FailureEle]()) { (r, e) =>
val (s, f) = r
e match {
case fail: FailureEle =>
s -> (f :+ fail)
case succ: SuccessEle =>
(s :+ succ) -> f
}
}.dropWhile { case (succ, fail) => fail.size < error }.take(1)
}
def test() {
def randomFail = {
val nInt = scala.util.Random.nextInt()
println("getting" + nInt)
if(nInt % 5 == 0 )
FailureEle(new Exception("fooo"))
else
SuccessEle(nInt.toString)
}
val infinite = Process.repeatEval(Task.delay(randomFail))
val r = parse(infinite, 3).runLast.run
println(r)
}

How to best implement "first success" in Scala (i.e., return the first success from a sequence of failure-prone operations)

Per the title, there are a couple of reasonable and idiomatic ways that I know of to return the first successful computation, though I'm most interested here in how to handle the case when we want to know the specific failure of the last attempt when all attempts fail. As a first attempt, we can use collectFirst and do something like the following:
def main(args: Array[String]) {
val xs = (1 to 5)
def check(i: Int): Try[Int] = {
println(s"checking: $i")
Try(if (i < 3) throw new RuntimeException(s"small: $i") else i)
}
val z = xs.collectFirst { i => check(i) match { case s # Success(x) => s } }
println(s"final val: $z")
}
This seems like a reasonable solution if we don't care about the failures (actually, since we're always returning a success, we never return a Failure, only a None in the case there is no successful computation).
On the other hand, to handle the case when all attempts fail, we can capture the last failure by using the following:
def main2(args: Array[String]) {
val xs = (1 to 5)
def check(i: Int): Try[Int] = {
println(s"checking: $i")
Try(if (i < 3) throw new RuntimeException(s"small: $i") else i)
}
val empty: Try[Int] = Failure(new RuntimeException("empty"))
val z = xs.foldLeft(empty)((e, i) => e.recoverWith { case _ => check(i) })
println(s"final val: $z")
}
The disadvantages here are that you create a "fake" Throwable representing empty, and if the list is very long, we iterate over the whole list, even though we may have succeeded very early on, even if later iterations are essentially no-ops.
Is there a better way to implement main2 that is idiomatic and doesn't suffer from the aforementioned disadvantages?

You could do something like this:
#tailrec
def collectFirstOrFailure[T](l: List[T], f: T => Try[T]): Try[T] = {
l match {
case h :: Nil => f(h)
case h :: t => // f(h) orElse collectFirstOrFailure(t, f) //wish I could do this but not tailrec approved!
val res = f(h)
if (res.isFailure){
collectFirstOrFailure(t, f)
}
else {
res
}
case Nil => Failure(new RuntimeException("empty"))
}
}
val y = collectFirstOrFailure(xs.toList, check)
println(s"final val: $y")
This isn't very pretty, and we do still have to handle the empty list case, but we're not creating a new Failure(new RuntimeException("empty")) with every run (unless it's an empty list) and we stop short if there's a success. I feel like scalaz has some better way to do this but I can't figure it out right now. The returning the last failure requirement is making this a bit complex.
UPDATE
There's always iterator...
def collectFirstOrFailureI[T](i: Iterator[T], f: T => Try[T]): Try[T] = {
while (i.hasNext){
val res = f(i.next())
if (res.isSuccess || !i.hasNext){
return res
}
}
Failure(new RuntimeException("empty"))
}
xs.toIterator
val x = collectFirstOrFailureI(xs.iterator, check)
println(s"final val: $x")

There's a previous answer:
https://stackoverflow.com/a/20665337/1296806
with the caveat that your question asks for the last failure, if all have failed.
I guess that's why this isn't a duplicate?
That's trivial to add to the code from that answer:
def bad(f: Failure) = if (count.decrementAndGet == 0) { p tryComplete new Failure(new RuntimeException("All bad", f.exception)) }
or more simply
p tryComplete f

What would be the good name for this operation?

I see that Scala standard library misses the method to get ranges of objects in the collection, that satisfy the predicate:
def <???>(p: A => Boolean): List[List[A]] = {
val buf = collection.mutable.ListBuffer[List[A]]()
var elems = this.dropWhile(e => !p(e))
while (elems.nonEmpty) {
buf += elems.takeWhile(p)
elems = elems.dropWhile(e => !p(e))
}
buf.toList
}
What would be the good name for such method? And is my implementation good enough?

I'd go for chunkWith or chunkBy
As for your implementation, I think this cries out for recursion! See if you can fill out this
#tailrec def chunkBy[A](l: List[A], acc: List[List[A]] = Nil)(p: A => Boolean): List[List[A]] = l match {
case Nil => acc
case l =>
val next = l dropWhile !p
val (chunk, rest) = next span p
chunkBy(rest, chunk :: acc)(p)
}
Why recursion? It's much easier to understand the algorithm and more likely to be bug-free (given the absence of vars).
The syntax !p for the negation of a predicate is achieved via an implicit conversion
implicit def PredicateW[A](p: A => Boolean) = new {
def unary_! : A => Boolean = a => !p(a)
}
I generally keep this around as it's astoundingly useful

How about:
def chunkBy[K](f: A => K): Map[K, List[List[A]]] = ...
Similar to groupBy but keeps contiguous chunks as chunks.
Using this, you can do xs.chunkBy(p)(true) to get what you want.

You probably want to call it splitWith because split is the string operation that more-or-less does that, and it's similar to splitAt.
Incidentally, here's a very compact implementation (though it does a lot of unnecessary work, so it's not a good implementation for speed; yours is fine for that):
def splitWith[A](xs: List[A])(p: A => Boolean) = {
(xs zip xs.scanLeft(1){ (i,x) => if (p(x) == ((i&1)==1)) i+1 else i }.tail).
filter(_._2 % 2 == 0).groupBy(_._2).toList.sortBy(_._1).map(_._2.map(_._1))
}

Just a little refinement of oxbow's code, this way the signature is lighter
def chunkBy[A](xs: List[A])(p: A => Boolean): List[List[A]] = {
#tailrec
def recurse(todo: List[A], acc: List[List[A]]): List[List[A]] = todo match {
case Nil => acc
case _ =>
val next = todo dropWhile (!p(_))
val (chunk, rest) = next span p
recurse(rest, acc ::: List(chunk))
}
recurse(xs, Nil)
}