Scala: efficient iteration over multiple iterators - scala

I'm writing an application server and there is a message sending loop. A message is composed of fields and thus can be viewed as an iterator that iterates over the fields. And there is a message queue that is processed by the message loop, but the loop is breakable at any time (e.g. when the socket buffer is full) and can be resumed later. Current implementation looks like:
private val messageQueue: Queue[Iterator[Field]]
sent = 0
breakable {
for (iterator <- messageQueue) {
for (field <- iterator) {
... breakable ...
}
sent += 1
}
} finally messageQueue.trimStart(sent)
This works and is not bad, but then I thought I could make the code a bit cleaner if I could replace the queue by an iterator that concatenates iterators using the ++ operator. To say:
private val messageQueue: Iterator[Field] = message1.iterator ++ message2.iterator ++ ...
breakable {
for (field <- messageQueue) {
... breakable ...
}
}
Now the code looks much cleaner but there's a performance issue. Concatenated iterators form a (unbalanced) tree internally so the next() operation takes O(n) of time. So the iteration takes O(n^2) of time overall.
To summarize, the messages need to be processed just once so the queue doesn't need to be a Traversable. An Iterator (TraversableOnce) would do. I'd like to view the message queue as a collection of consecutive iterators but the ++ has a performance issue. Would there be a nice solution that makes the code cleaner but is efficient at the same time?

What if you just flatten them?
def flattenIterator[T](l: List[Iterator[T]]): Iterator[T] = l.iterator.flatten

Have you thought about using Stream and #::: to lazily concatenate your messages together?
private val messageQueue: Stream[Field] = message1.toStream #::: message2.toStream #::: ...
breakable {
for (field <- messageQueue) {
... breakable ...
}
}
As for the time complexity here, I believe it would be O(n) in the number of iterators you're concatenating (you need to call toStream for each iterator and #::: them together). However, the individual toStream and #::: operations should be O(1) since they're lazy. Here's the toStream implementation for Iterator:
def toStream: Stream[A] =
if (self.hasNext) Stream.cons(self.next, self.toStream)
else Stream.empty[A]
This will take constant time because the 2nd argument to Stream.cons is call-by-name, so it won't get evaluated until you actually access the tail.
However, the conversion to Stream will add a constant factor of overhead for each element access, i.e. instead of just calling next on the iterator it will have to do a few extra method calls to force the lazy tail of the stream and access the contained value.

Related

Lazy Pagination in Scala (Stream/Iterator of Iterators?)

I'm reading a very large number of records sequentially from database API one page at a time (with unknown number of records per page) via call to def readPage(pageNumber: Int): Iterator[Record]
I'm trying to wrap this API in something like either Stream[Iterator[Record]] or Iterator[Iterator[Record]] lazily, in a functional way, ideally no mutable state, with constant memory footprint, so that I can treat it as infinite stream of pages or sequence of Iterators, and abstract away the pagination from the client. Client can iterate on the result, by calling next() it will retrieve the next page (Iterator[Record]).
What is the most idiomatic and efficient way to implement this in Scala.
Edit: need to fetch & process the records one page at a time, cannot maintain all the records from all pages in memory. If one page fails, throw an exception. Large number of pages/records means infinite for all practical purposes. I want to treat it as infinite stream (or iterator) of pages with each page being an iterator for finite number of records (e.g. less <1000 but exact number is unknown ahead if time).
I looked at BatchCursor in Monix but it serves a different purpose.
Edit 2: this is the current version using Tomer's answer below as starting point, but using Stream instead of Iterator.
This allows to eliminate the need in tail recursion as per https://stackoverflow.com/a/10525539/165130, and have O(1) time for stream prepend #:: operation (while if we've concatenated iterators via ++ operation it would be O(n))
Note: While streams are lazily evaluated, Stream memoization may still cause memory blow up, and memory management gets tricky. Changing from val to def to define the Stream in def pages = readAllPages below doesn't seem to have any effect
def readAllPages(pageNumber: Int = 0): Stream[Iterator[Record]] = {
val iter: Iterator[Record] = readPage(pageNumber)
if (iter.isEmpty)
Stream.empty
else
iter #:: readAllPages(pageNumber + 1)
}
//usage
val pages = readAllPages
for{
page<-pages
record<-page
if(isValid(record))
}
process(record)
Edit 3:
the second suggestion by Tomer seems to be the best, its runtime and memory footprint is similar to the above solution but it is much more concise and error-prone.
val pages = Stream.from(1).map(readPage).takeWhile(_.nonEmpty)
Note: Stream.from(1) creates a stream starting from 1 and incrementing by 1, it's in API docs
You can try implement such logic:
def readPage(pageNumber: Int): Iterator[Record] = ???
#tailrec
def readAllPages(pageNumber: Int): Iterator[Iterator[Record]] = {
val iter = readPage(pageNumber)
if (iter.nonEmpty) {
// Compute on records
// When finishing computing:
Iterator(iter) ++ readAllPages(pageNumber + 1)
} else {
Iterator.empty
}
}
readAllPages(0)
A shorter option will be:
Stream.from(1).map(readPage).takeWhile(_.nonEmpty)

Scala: dynamic programming recursion using iterators

Learning how to do dynamic programming in Scala, and I'm often finding myself in a situation where I want to recursively proceed over an array (or some other iterable) of items. When I do this, I tend to write cumbersome functions like this:
def arraySum(array: Array[Int], index: Int, accumulator: Int): Int => {
if (index == array.length) {
accumulator
} else {
arraySum(array, index + 1, accumulator + array(index)
}
}
arraySum(Array(1,2,3), 0, 0)
(Ignore for a moment that I could just call sum on the array or do a .reduce(_ + _), I'm trying to learn programming principles.)
But this seems like I'm passing alot of variables, and what exactly is the point of passing the array to each function call? This seems unclean.
So instead I got the idea to do this with iterators and not worry about passing indexes:
def arraySum(iter: Iterator[Int])(implicit accumulator: Int = 0): Int = {
try {
val nextInt = iter.next()
arraySum(iter)(accumulator + nextInt)
} catch {
case nee: NoSuchElementException => accumulator
}
}
arraySum(Array(1,2,3).toIterator)
This seems like a much cleaner solution. However, this falls apart when you need to use dynamic programming to explore some outcome space and you don't need to call the iterator at every function call. E.g.
def explore(iter: Iterator[Int])(implicit accumulator: Int = 0): Int = {
if (someCase) {
explore(iter)(accumulator)
} else if (someOtherCase){
val nextInt = iter.next()
explore(iter)(accumulator + nextInt)
} else {
// Some kind of aggregation/selection of explore results
}
}
My understanding is that the iter iterator here functions as pass by reference, so when this function calls iter.next() that changes the instance of iter that is passed to all other recursive calls of the function. So to get around that, now I'm cloning the iterator at every call of the explore function. E.g.:
def explore(iter: Iterator[Int])(implicit accumulator: Int = 0): Int = {
if (someCase) {
explore(iter)(accumulator)
} else if (someOtherCase){
val iterClone = iter.toList.toIterator
explore(iterClone)(accumulator + iterClone.next())
} else {
// Some kind of aggregation/selection of explore results
}
}
But this seems pretty stupid, and the stupidity escalates when I have multiple iterators that may or may not need cloning in multiple else if cases. What is the right way to handle situations like this? How can I elegantly solve these kinds of problems?
Suppose that you want to write a back-tracking recursive function that needs some complex data structure as an argument, so that the recursive calls receive a slightly modified version of the data structure. You have several options how you could do it:
Clone the entire data structure, modify it, pass it to recursive call. This is very simple, but usually very expensive.
Modify the mutable structure in-place, pass it to the recursive call, then revert the modification when backtracking. You have to ensure that every possible call of your recursive function always restores the original state of the data structure exactly. This is much more efficient, but is hard to implement, because it can be very error prone.
Subdivide the structure into a large immutable and a small mutable part. For example, you could pass an index (or a pair of indices) that specify some slice of an array explicitly, along with an array that is never mutated. You could then "clone" and save only the mutable part, and restore it when backtracking. If it works, it is both simple and fast, but it doesn't always work, because substructures can be hard to describe by just few integer indices.
Rely on persistent immutable data structures whenever you can.
I'd like to elaborate on the last point, because this is the preferred way to do it in Scala and in functional programming in general.
Here is your original code, that uses the third strategy:
def arraySum(array: Array[Int], index: Int, accumulator: Int): Int = {
if (index == array.length) {
accumulator
} else {
arraySum(array, index + 1, accumulator + array(index))
}
}
If you would use a List instead of an Array, you could rewrite it to this:
#annotation.tailrec
def listSum(list: List[Int], acc: Int): Int = list match {
case Nil => acc
case h :: t => listSum(t, acc + h)
}
Here, h :: t is a pattern that deconstructs the list into the head and the tail.
Note that you don't need an explicit index any more, because accessing the tail t of the list is a constant-time operation, so that only the relevant remaining sublist is passed to the recursive call of listSum.
There is no backtracking here, but if the recursive method would backtrack, using lists would bring another advantage: extracting a sublist is almost free (constant time operation), but it's still guaranteed to be immutable, so you can just pass it into the recursive call, without having to care about whether the recursive call modifies it or not, and so you don't have to do anything to undo any modifications that could have been done by the recursive calls. This is the advantage of persistent immutable data structures: related lists can share most of their structure, while still appearing immutable from the outside, so that it's impossible to break anything in the parent list just because you have the access to the tail of this list. This would not be the case with a view over a mutable array.

Losing types on sequencing Futures

I'm trying to do this:
case class ConversationData(members: Seq[ConversationMemberModel], messages: Seq[MessageModel])
val membersFuture: Future[Seq[ConversationMemberModel]] = ConversationMemberPersistence.searchByConversationId(conversationId)
val messagesFuture: Future[Seq[MessageModel]] = MessagePersistence.searchByConversationId(conversationId)
Future.sequence(List(membersFuture, messagesFuture)).map{ result =>
// some magic here
self ! ConversationData(members, messages)
}
But when I'm sequencing the two futures compiler is losing types. The compiler says that type of result is List[Seq[Product with Serializable]] At the beginning I expect to do something like
Future.sequence(List(membersFuture, messagesFuture)).map{ members, messages => ...
But it looks like sequencing futures don't work like this... I also tried to using a collect inside the map but I get similar errors.
Thanks for your help
When using Future.sequence, the assumption is that the underlying types produced by the multiple Futures are the same (or extend from the same parent type). With sequence, you basically invert a Seq of Futures for a particular type to a single Future for a Seq of that particular type. A concrete example is probably more illustrative of that point:
val f1:Future[Foo] = ...
val f2:Future[Foo] = ...
val f3:Future[Foo] = ...
val futures:List[Future[Foo]] = List(f1, f2, f3)
val aggregateFuture:Future[List[Foo]] = Future.sequence(futures)
So you can see that I went from a List of Future[Foo] to a single Future wrapping a List[Foo]. You use this when you already have a bunch of Futures for results of the same type (or base type) and you want to aggregate all of the results for the next processing step. The sequence method product a new Future that won't be completed until all of the aggregated Futures are done and it will then contain the aggregated results of all of those Futures. This works especially well when you have an indeterminate or variable number of Futures to process.
For your case, it seems that you have a fixed number of Futures to handle. As #Zoltan suggested, a simple for comprehension is probably a better fit here because the number of Futures is known. So solving your problem like so:
for{
members <- membersFuture
messages <- messagesFuture
} {
self ! ConversationData(members, messages)
}
is probably the best way to go for this specific example.
What are you trying to achieve with the sequence call? I'd just use a for-comprehension instead:
val membersFuture: Future[Seq[ConversationMemberModel]] = ConversationMemberPersistence.searchByConversationId(conversationId)
val messagesFuture: Future[Seq[MessageModel]] = MessagePersistence.searchByConversationId(conversationId)
for {
members <- membersFuture
messages <- messagesFuture
} yield (self ! ConversationData(members, messages))
Note that it is important that you declare the two futures outside the for-comprehension, because otherwise your messagesFuture wouldn't be submitted until the membersFuture is completed.
You could also use zip:
membersFuture.zip(messagesFuture).map {
case (members, messages) => self ! ConversationData(members, messages)
}
but I'd prefer the for-comprehension.

Scala: Thread safe mutable lazy Iterator with append

For an immutable flavour, Iterator does the job.
val x = Iterator.fill(100000)(someFn)
Now I want to implement a mutable version of Iterator, with three guarantees:
thread-safe on all transformations(fold, foldLeft, ..) and append
lazy evaluated
traversable only once! Once used, an object from this Iterator should be destroyed.
Is there an existing implementation to give me these guarantees? Any library or framework example would be great.
Update
To illustrate the desired behaviour.
class SomeThing {}
class Test(val list: Iterator[SomeThing]) {
def add(thing: SomeThing): Test = {
new Test(list ++ Iterator(thing))
}
}
(new Test()).add(new SomeThing).add(new SomeThing);
In this example, SomeThing is an expensive construct, it needs to be lazy.
Re-iterating over list is never required, Iterator is a good fit.
This is supposed to asynchronously and lazily sequence 10 million SomeThing instances without depleting the executor(a cached thread pool executor) or running out of memory.
You don't need a mutable Iterator for this, just daisy-chain the immutable form:
class SomeThing {}
case class Test(val list: Iterator[SomeThing]) {
def add(thing: => SomeThing) = Test(list ++ Iterator(thing))
}
(new Test()).add(new SomeThing).add(new SomeThing)
Although you don't really need the extra boilerplate of Test here:
Iterator(new SomeThing) ++ Iterator(new SomeThing)
Note that Iterator.++ takes a by-name param, so the ++ operation is already lazy.
You might also want to try this, to avoid building intermediate Iterators:
Iterator.continually(new SomeThing) take 2
UPDATE
If you don't know the size in advance, then I'll often use a tactic like this:
def mkSomething = if(cond) Some(new Something) else None
Iterator.continually(mkSomething) takeWhile (_.isDefined) map { _.get }
The trick is to have your generator function wrap its output in an Option, which then gives you a way to flag that the iteration is finished by returning None
Of course... If you're really pushing out the boat, you can even use the dreaded null:
def mkSomething = if(cond) { new Something } else null
Iterator.continually(mkSomething) takeWhile (_ != null)
Seems like you need to hide the fact that the iterator is mutable but at the same time allow it to grow mutably. What I'm going to propose is the same sort of trick I've used to speed up ::: in the past:
abstract class AppendableIterator[A] extends Iterator[A]{
protected var inner: Iterator[A]
def hasNext = inner.hasNext
def next() = inner next ()
def append(that: Iterator[A]) = synchronized{
inner = new JoinedIterator(inner, that)
}
}
//You might need to add some more things, this is a skeleton
class JoinedIterator[A](first: Iterator[A], second: Iterator[A]) extends Iterator[A]{
def hasNext = first.hasNext || second.hasNext
def next() = if(first.hasNext) first next () else if(second.hasNext) second next () else Iterator.next()
}
So what you're really doing is leaving the Iterator at whatever place in its iteration you might have it while still preserving the thread safety of the append by "joining" another Iterator in non-destructively. You avoid the need to recompute the two together because you never actually force them through a CanBuildFrom.
This is also a generalization of just adding one item. You can always wrap some A in an Iterator[A] of one element if you so choose.
Have you looked at the mutable.ParIterable in the collection.parallel package?
To access an iterator over elements you can do something like
val x = ParIterable.fill(100000)(someFn).iterator
From the docs:
Parallel operations are implemented with divide and conquer style algorithms that parallelize well. The basic idea is to split the collection into smaller parts until they are small enough to be operated on sequentially.
...
The higher-order functions passed to certain operations may contain side-effects. Since implementations of bulk operations may not be sequential, this means that side-effects may not be predictable and may produce data-races, deadlocks or invalidation of state if care is not taken. It is up to the programmer to either avoid using side-effects or to use some form of synchronization when accessing mutable data.

Scala: What is the difference between Traversable and Iterable traits in Scala collections?

I have looked at this question but still don't understand the difference between Iterable and Traversable traits. Can someone explain ?
Think of it as the difference between blowing and sucking.
When you have call a Traversables foreach, or its derived methods, it will blow its values into your function one at a time - so it has control over the iteration.
With the Iterator returned by an Iterable though, you suck the values out of it, controlling when to move to the next one yourself.
To put it simply, iterators keep state, traversables don't.
A Traversable has one abstract method: foreach. When you call foreach, the collection will feed the passed function all the elements it keeps, one after the other.
On the other hand, an Iterable has as abstract method iterator, which returns an Iterator. You can call next on an Iterator to get the next element at the time of your choosing. Until you do, it has to keep track of where it was in the collection, and what's next.
tl;dr Iterables are Traversables that can produce stateful Iterators
First, know that Iterable is subtrait of Traversable.
Second,
Traversable requires implementing the foreach method, which is used by everything else.
Iterable requires implementing the iterator method, which is used by everything else.
For example, the implemetation of find for Traversable uses foreach (via a for comprehension) and throws a BreakControl exception to halt iteration once a satisfactory element has been found.
trait TravserableLike {
def find(p: A => Boolean): Option[A] = {
var result: Option[A] = None
breakable {
for (x <- this)
if (p(x)) { result = Some(x); break }
}
result
}
}
In contrast, the Iterable subtract overrides this implementation and calls find on the Iterator, which simply stops iterating once the element is found:
trait Iterable {
override /*TraversableLike*/ def find(p: A => Boolean): Option[A] =
iterator.find(p)
}
trait Iterator {
def find(p: A => Boolean): Option[A] = {
var res: Option[A] = None
while (res.isEmpty && hasNext) {
val e = next()
if (p(e)) res = Some(e)
}
res
}
}
It'd be nice not to throw exceptions for Traversable iteration, but that's the only way to partially iterate when using just foreach.
From one perspective, Iterable is the more demanding/powerful trait, as you can easily implement foreach using iterator, but you can't really implement iterator using foreach.
In summary, Iterable provides a way to pause, resume, or stop iteration via a stateful Iterator. With Traversable, it's all or nothing (sans exceptions for flow control).
Most of the time it doesn't matter, and you'll want the more general interface. But if you ever need more customized control over iteration, you'll need an Iterator, which you can retrieve from an Iterable.
Daniel's answer sounds good. Let me see if I can to put it in my own words.
So an Iterable can give you an iterator, that lets you traverse the elements one at a time (using next()), and stop and go as you please. To do that the iterator needs to keep an internal "pointer" to the element's position. But a Traversable gives you the method, foreach, to traverse all elements at once without stopping.
Something like Range(1, 10) needs to have only 2 integers as state as a Traversable. But Range(1, 10) as an Iterable gives you an iterator which needs to use 3 integers for state, one of which is an index.
Considering that Traversable also offers foldLeft, foldRight, its foreach needs to traverse the elements in a known and fixed order. Therefore it's possible to implement an iterator for a Traversable. E.g.
def iterator = toList.iterator