Scala - merging multiple iterators - scala

I have multiple iterators which return items in a sorted manner according to some sorting criterion. Now, I would like to merge (multiplex) the iterators into one, combined iterator. I know how to do it in Java style, with e.g. tree-map, but I was wondering if there is a more functional approach? I want to preserve the laziness of the iterators as much as possible.

You can just do:
val it = iter1 ++ iter2
It creates another iterator and does not evaluate the elements, but wraps the two existing iterators.
It is fully lazy, so you are not supposed to use iter1 or iter2 once you do this.
In general, if you have more iterators to merge, you can use folding:
val iterators: Seq[Iterator[T]] = ???
val it = iterators.foldLeft(Iterator[T]())(_ ++ _)
If you have some ordering on the elements that you would like to maintain in the resulting iterator but you want lazyness, you can convert them to streams:
def merge[T: Ordering](iter1: Iterator[T], iter2: Iterator[T]): Iterator[T] = {
val s1 = iter1.toStream
val s2 = iter2.toStream
def mergeStreams(s1: Stream[T], s2: Stream[T]): Stream[T] = {
if (s1.isEmpty) s2
else if (s2.isEmpty) s1
else if (s1.head < s2.head) s1.head #:: mergeStreams(s1.tail, s2)
else s2.head #:: mergeStreams(s1, s2.tail)
}
mergeStreams(s1, s2).iterator
}
Not necessarily faster though, you should microbenchmark this.
A possible alternative is to use buffered iterators to achieve the same effect.

Like #axel22 mentioned, you can do this with BufferedIterators. Here's one Stream-free solution:
def combine[T](rawIterators: List[Iterator[T]])(implicit cmp: Ordering[T]): Iterator[T] = {
new Iterator[T] {
private val iterators: List[BufferedIterator[T]] = rawIterators.map(_.buffered)
def hasNext: Boolean = iterators.exists(_.hasNext)
def next(): T = if (hasNext) {
iterators.filter(_.hasNext).map(x => (x.head, x)).minBy(_._1)(cmp)._2.next()
} else {
throw new UnsupportedOperationException("Cannot call next on an exhausted iterator!")
}
}

You could try:
(iterA ++ iterB).toStream.sorted.toIterator
For example:
val i1 = (1 to 100 by 3).toIterator
val i2 = (2 to 100 by 3).toIterator
val i3 = (3 to 100 by 3).toIterator
val merged = (i1 ++ i2 ++ i3).toStream.sorted.toIterator
merged.next // results in: 1
merged.next // results in: 2
merged.next // results in: 3

Related

What is the best way to merge two Future[Map[T1, T2]] in Scala

I have a list of fileNames and I want to load the correlated pages in batches (and not all at once). In order to do so, I'm using FoldLeft and I'm writing an aggregate function which aggregates a Future[Map[T1,T2]].
def loadPagesInBatches[T1, T2](fileNames: Set[FileName]): Future[Map[T1, T2]] = {
val fileNameToPageId: Map[FileName, PageId] = ... //invokes a function that returns the pageId correlated to the fileName.
val batches: Iterator[Set[FileName]] = fileNames.grouped(10) //batches of 10;
batches.foldLeft(Future(Map.empty[T1, T2]))(aggregate(fileNameToPageId))
}
And the signature of aggregate is as follows:
def aggregate(fileNameToPageId: Map[FileName, PageId]): (Future[Map[T1, T2]], Set[FileName]) => Future[Map[T1, T2]] = {..}
I'm trying to make sure what is the best way to merge these Future[Map]s.
Thanks ahead!
P.S: FileName and PageId are just Types of string.
In case you have exactly 2 futures, zipWith would probably be the most idiomatic.
val future1 = ???
val future2 = ???
future1.zipWith(future2)(_ ++ _)
Which is a shorter way of writing a for comprehension:
for {
map1 <- future1
map2 <- future2
} yield map1 ++ map2
Although zipWith could potentially implement some kind of optimization.
My solution was putting the two maps into a list and using Future.reduceLeft.
def aggregate(fileNameToPageId: Map[FileName, PageId]): (Future[Map[T1, T2]], Set[FileName]) => Future[Map[T1, T2]] = {
case (all, filesBatch) =>
val mapOfPages: Future[Map[NodeId, T]] = for {
... //Some logic
} yield "TheBatchMap"
Future.reduceLeft(List(all, mapOfPages))(_ ++ _)
}

Scala streams tailrecursive construction

Is there a benefit to build a Stream in a tail-recursive way with an accumulator? The #tailrec annotation will turn a recursion into a loop, but the loop would be evaluated strictly.
I can figure out a simple way to add at the head of the stream in a tail-recursive way
#tailrec
def toStream(current:A, f: A => B, next: A => A, previous:Stream[B]):Stream[B]] = {
if(exitCondition(a))
previous
else{
val newItem = f(current)
val newCurrent = next(a)
toStream(nextCurrent,f,next,Stream cons (newItem, previous) )
}
}
And how to add at the end (building a real lazy stream) without tail-recursion
def toStream(current:A, f: A => B, next: A => A):Stream[B] = {
if(exitCondition(a))
Stream.empty[B]
else{
val newItem = f(current)
val newCurrent = next(a)
Stream.cons (newItem, toStream(newCurrent))
}
}
How would you code the dual of this function:
Add at the head in non tail-recursive?
Add the end tail-recursive

What is the ideal collection for incremental (with multiple passings) filtering of collection?

I've seen many questions about Scala collections and could not decide.
This question was the most useful until now.
I think the core of the question is twofold:
1) Which are the best collections for this use case?
2) Which are the recommended ways to use them?
Details:
I am implementing an algorithm that iterates over all elements in a collection
searching for the one that matches a certain criterion.
After the search, the next step is to search again with a new criterion, but without the chosen element among the possibilities.
The idea is to create a sequence with all original elements ordered by the criterion (which changes at every new selection).
The original sequence doesn't really need to be ordered, but there can be duplicates (the algorithm will only pick one at a time).
Example with a small sequence of Ints (just to simplify):
object Foo extends App {
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt), Seq()))
println("ti = " + ti)
//algorithm
def recur(collection: Seq[Int], already_selected: Seq[Int]): (Seq[Int], Seq[Int]) =
if (collection.isEmpty) (Seq(), already_selected)
else {
val selected = collection maxBy f(already_selected)
val rest = collection diff Seq(selected) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}
Try #inline and as icn suggested How can I idiomatically "remove" a single element from a list in Scala and close the gap?:
object Foo extends App {
#inline
def f(already_selected: Seq[Int])(element: Int): Double =
// something more complex happens here,
// specially something take takes 'already_selected' into account
math.sqrt(element)
//call to the algorithm
val (result, ti) = Tempo.time(recur(Seq.fill(9900)(Random.nextInt()).zipWithIndex, Seq()))
println("ti = " + ti)
//algorithm
#tailrec
def recur(collection: Seq[(Int, Int)], already_selected: Seq[Int]): Seq[Int] =
if (collection.isEmpty) already_selected
else {
val (selected, i) = collection.maxBy(x => f(already_selected)(x._2))
val rest = collection.patch(i, Nil, 1) //this part doesn't seem to be efficient
recur(rest, selected +: already_selected)
}
}
object Tempo {
def time[T](f: => T): (T, Double) = {
val s = System.currentTimeMillis
(f, (System.currentTimeMillis - s) / 1000d)
}
}

How to use takeWhile with an Iterator in Scala

I have a Iterator of elements and I want to consume them until a condition is met in the next element, like:
val it = List(1,1,1,1,2,2,2).iterator
val res1 = it.takeWhile( _ == 1).toList
val res2 = it.takeWhile(_ == 2).toList
res1 gives an expected List(1,1,1,1) but res2 returns List(2,2) because iterator had to check the element in position 4.
I know that the list will be ordered so there is no point in traversing the whole list like partition does. I like to finish as soon as the condition is not met. Is there any clever way to do this with Iterators? I can not do a toList to the iterator because it comes from a very big file.
The simplest solution I found:
val it = List(1,1,1,1,2,2,2).iterator
val (r1, it2) = it.span( _ == 1)
println(s"group taken is: ${r1.toList}\n rest is: ${it2.toList}")
output:
group taken is: List(1, 1, 1, 1)
rest is: List(2, 2, 2)
Very short but further you have to use new iterator.
With any immutable collection it would be similar:
use takeWhile when you want only some prefix of collection,
use span when you need rest also.
With my other answer (which I've left separate as they are largely unrelated), I think you can implement groupWhen on Iterator as follows:
def groupWhen[A](itr: Iterator[A])(p: (A, A) => Boolean): Iterator[List[A]] = {
#annotation.tailrec
def groupWhen0(acc: Iterator[List[A]], itr: Iterator[A])(p: (A, A) => Boolean): Iterator[List[A]] = {
val (dup1, dup2) = itr.duplicate
val pref = ((dup1.sliding(2) takeWhile { case Seq(a1, a2) => p(a1, a2) }).zipWithIndex collect {
case (seq, 0) => seq
case (Seq(_, a), _) => Seq(a)
}).flatten.toList
val newAcc = if (pref.isEmpty) acc else acc ++ Iterator(pref)
if (dup2.nonEmpty)
groupWhen0(newAcc, dup2 drop (pref.length max 1))(p)
else newAcc
}
groupWhen0(Iterator.empty, itr)(p)
}
When I run it on an example:
println( groupWhen(List(1,1,1,1,3,4,3,2,2,2).iterator)(_ == _).toList )
I get List(List(1, 1, 1, 1), List(2, 2, 2))
I had a similar need, but the solution from #oxbow_lakes does not take into account the situation when the list has only one element, or even if the list contains elements that are not repeated. Also, that solution doesn't lend itself well to an infinite iterator (it wants to "see" all the elements before it gives you a result).
What I needed was the ability to group sequential elements that match a predicate, but also include the single elements (I can always filter them out if I don't need them). I needed those groups to be delivered continuously, without having to wait for the original iterator to be completely consumed before they are produced.
I came up with the following approach which works for my needs, and thought I should share:
implicit class IteratorEx[+A](itr: Iterator[A]) {
def groupWhen(p: (A, A) => Boolean): Iterator[List[A]] = new AbstractIterator[List[A]] {
val (it1, it2) = itr.duplicate
val ritr = new RewindableIterator(it1, 1)
override def hasNext = it2.hasNext
override def next() = {
val count = (ritr.rewind().sliding(2) takeWhile {
case Seq(a1, a2) => p(a1, a2)
case _ => false
}).length
(it2 take (count + 1)).toList
}
}
}
The above is using a few helper classes:
abstract class AbstractIterator[A] extends Iterator[A]
/**
* Wraps a given iterator to add the ability to remember the last 'remember' values
* From any position the iterator can be rewound (can go back) at most 'remember' values,
* such that when calling 'next()' the memoized values will be provided as if they have not
* been iterated over before.
*/
class RewindableIterator[A](it: Iterator[A], remember: Int) extends Iterator[A] {
private var memory = List.empty[A]
private var memoryIndex = 0
override def next() = {
if (memoryIndex < memory.length) {
val next = memory(memoryIndex)
memoryIndex += 1
next
} else {
val next = it.next()
memory = memory :+ next
if (memory.length > remember)
memory = memory drop 1
memoryIndex = memory.length
next
}
}
def canRewind(n: Int) = memoryIndex - n >= 0
def rewind(n: Int) = {
require(memoryIndex - n >= 0, "Attempted to rewind past 'remember' limit")
memoryIndex -= n
this
}
def rewind() = {
memoryIndex = 0
this
}
override def hasNext = it.hasNext
}
Example use:
List(1,2,2,3,3,3,4,5,5).iterator.groupWhen(_ == _).toList
gives: List(List(1), List(2, 2), List(3, 3, 3), List(4), List(5, 5))
If you want to filter out the single elements, just apply a filter or withFilter after groupWhen
Stream.continually(Random.nextInt(100)).iterator
.groupWhen(_ + _ == 100).withFilter(_.length > 1).take(3).toList
gives: List(List(34, 66), List(87, 13), List(97, 3))
You could use method toStream on Iterator.
Stream is a lazy equivalent of List.
As you can see from implementation of toStream it creates a Stream without traversing the whole Iterator.
Stream keeps all element in memory. You should localize usage of link to Stream in some local scope to prevent memory leaking.
With Stream you should use span like this:
val (res1, rest1) = stream.span(_ == 1)
val (res2, rest2) = rest1.span(_ == 2)
I'm guessing a bit here but by the statement "until a condition is met in the next element", it sounds like you might want to look at the groupWhen method on ListOps in scalaz
scala> import scalaz.syntax.std.list._
import scalaz.syntax.std.list._
scala> List(1,1,1,1,2,2,2) groupWhen (_ == _)
res1: List[List[Int]] = List(List(1, 1, 1, 1), List(2, 2, 2))
Basically this "chunks" up the input sequence upon a condition (a (A, A) => Boolean) being met between an element and its successor. In the example above the condition is equality, so, as long as an element is equal to its successor, they will be in the same chunk.

Creating a repeating true/false List in scala

I want to generate a Seq/List of true/false values which I can zip with some input in order to do the equivalent of checking whether a for loop index is odd/even.
Is there a better way than
input.zip((1 to n).map(_ % 2 == 0))
or
input.zip(List.tabulate(n)(_ % 2 != 0))
I would have thought something like (true, false).repeat(n/2) is more obvious
Using #DaveGriffith's idea:
input.zip(Stream.iterate(false)(!_))
Or, if you use this pattern in several places:
def falseTrueStream = Stream.iterate(false)(!_)
input.zip(falseTrueStream)
This has the distinct advantage of not needing to specify the size of the false-true list.
Edit:
Of course, def falseTrueStream creates the stream of true/false objects every time you use it, and as #DanielCSobral mentioned, making it a val will cause the objects to be held in memory (until the program ends if the val is on an object).
If you're slightly evil and want to prematurely optimize it, you can build the Stream objects yourself.
object TrueFalseStream extends Stream[Boolean] {
val tailDefined = true
override val isEmpty = false
override val head = true
override val tail = FalseTrueStream
}
object FalseTrueStream extends Stream[Boolean] {
val tailDefined = true
override val isEmpty = false
override val head = false
override val tail = TrueFalseStream
}
If you want a list of alternating true/false of size n:
List.iterate(false, n)(!_)
So then you could do:
val input = List("a", "b", "c", "d")
input.zip(List.iterate(false, input.length)(!_))
//List[(java.lang.String, Boolean)] = List((a,false), (b,true), (c,false), (d,true))
There's a very useful function in Haskell - cycle - which is useful for such purposes:
haskell> zip [1..7] $ cycle [True, False]
[(1,True),(2,False),(3,True),(4,False),(5,True),(6,False),(7,True)]
For some reason, Scala standard library doesn't have it. You can define it on your own, and then use it.
scala> def cycle[A](s: Stream[A]): Stream[A] = Stream.continually(s).flatten
cycle: [A](s: Stream[A])Stream[A]
scala> (1 to 7) zip cycle(Stream(true, false))
res13: scala.collection.immutable.IndexedSeq[(Int, Boolean)] = Vector((1,true), (2,false), (3,true), (4,false), (5,true), (6,false), (7,true))
You want
input.indices.map(_%2==0)
I couldn't come up with anything simpler (and this is far from simple):
(for(_ <- 1 to n/2) yield List(true, false)).flatten
and:
(1 to n/2).foldLeft(List[Boolean]()) {(cur,_) => List(true, false) ++ cur}
Watch for odd n!
However based on your requirements it looks like you might want to have something lazy:
def oddEven(init: Boolean): Stream[Boolean] = Stream.cons(init, oddEven(!init))
...and it never ends (try: oddEven(true) foreach println). Now you can take as much as you want:
oddEven(true).take(10).toList
...in order to do the equivalent of checking whether a for loop index is odd/even.
I'm ignoring your specific request, and addressing your main concern in a different way.
You can make your own control function, like so:
def for2[A,B](xs: List[A])(f: A => Unit, g: A => Unit): Unit = xs match {
case (y :: ys) => {
f(y)
for2(ys)(g, f)
}
case _ => Unit
}
Testing
> for2(List(0,1,2,3,4,5))((x) => println("E: " + x), (x) => println("O: " + x))
E: 0
O: 1
E: 2
O: 3
E: 4
O: 5