Scala - iterators and takeWhile

Scala - iterators and takeWhile - scala

I am running the following piece of code:
val it = List(1,1,1,2,2,3,3).iterator.buffered
val compare = it.head
it.takeWhile(_ == compare).toList
and it returns (1,1,1). However, if I run this as:
val it = List(1,1,1,2,2,3,3).iterator.buffered
it.takeWhile(_ == it.head).toList
I am getting (1,1). Why is this the case? Isn't head evaluated upon calling takeWhile and the result should be the same?

Because the iterator is mutable, the value of it.head depends on when it is evaluated.
Inspecting the implementation of takeWhile reveals that it removes the head of the iterator before applying the predicate.
So, on the third iteration, it.head evaluated from within the predicate will be 2, because the third element has already been removed.
This is an illustration of why you should prefer immutability. It rules out a whole class of non-obvious behaviour like this.

Adding to #Ben James answer above. Below is takeWhile method code (credits: ben):
def hasNext = hdDefined || tail.hasNext && {
hd = tail.next() //line 2
if (p(hd)) hdDefined = true
else tail = Iterator.empty
hdDefined
}
In the third iteration after line 2, the value is: hd=1 and remaining Iterator is List(2,2,3,3). on calling p(hd), it checks the iterator's head which in this case is 2. Hence it breaks.

Related

How to collect paginated resuls in scala

I am trying to collect paginated results by trying to do the following logic in Scala and failed pathetically:
def python_version():
cursor
books, cursor = fetch_results()
while (cursor!=null) {
new_books = fetch_results(cursor)
books = books + new_books
}
return books
def fetch_results(cursor=None):
#do some fetchings...
return books, next_cursor

Here is an alternative solution using a recursive function, which avoids mutable values:
def fetchResults(c: Option[Cursor]=None): (List[Book], Option[Cursor]) = ...
def fetchAllResults(): List[Book] = {
#tailrec
def loop(cursor: Option[Cursor], res: List[Book]): List[Book] = {
val (books, newCursor) = fetchResults(cursor)
val newBooks = res ::: books
newCursor match {
case Some(_) =>
loop(newCursor, newBooks)
case None =>
newBooks
}
}
loop(None, Nil)
}
This is a fairly standard pattern for recursive functions in Scala where the actual recursion is done in an internal function. The result of the previous iteration is passed down to the next iteration and then returned from the function. This means that loop is a tail-recursive function and can be optimised by the compiler into a while loop. (The #tailrec annotation tells the compiler to warn if this is not actually tail-recursive)

Something like this, perhaps:
Iterator.iterate(fetch_results()) {
case (_, None) => (Nil, None)
case (books, cursor) => fetch_results(cursor)
}.map(_._1).takeWhile(_.nonEmpty).flatten.toList`
.iterate takes the first parameter to be the initial element of the iterator, and the second one is a function, that, given previous element, computes the next one.
So, this creates an iterator of tuples (Seq[Book], Cursor), starting with the initial return of fetch_results, and then keeps fetching more results, and accumulating them, until the nextCursor is None (I used None instead of null, because nulls are evil, and shouldn't really be used in a normal language, like scala, that provides enough facilities to avoid them).
Then .map(_._1) discards the cursors (don't need them any more), so we now have an iterator of pages, .takeWhile truncates the iterator at the first
page that is empty, then .flatten concatenates all inner Seqs together, and finally toList materializes all the elements, and returns the entire list of books.

Scala: what is the interest in using Iterators?

I have used Iterators after have worked with Regexes in Scala but I don't really understand the interest.
I know that it has a state and if I call the next() method on it, it will output a different result every time, but I don't see anything I can do with it and that is not possible with an Iterable.
And it doesn't seem to work as Akka Streams (for example) since the following example directly prints all the numbers (without waiting one second as I would expect it):
lazy val a = Iterator({Thread.sleep(1000); 1}, {Thread.sleep(1000); 2}, {Thread.sleep(1000); 3})
while(a.hasNext){ println(a.next()) }
So what is the purpose of using Iterators?

Perhaps, the most useful property of iterators is that they are lazy.
Consider something like this:
(1 to 10000)
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
This snippet will square 10000 numbers, then generate 10000 strings, and then return the second one.
This on the other hand:
(1 to 10000)
.iterator
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
... only computes two squares, and generates two strings.
Iterators are also often useful when you need to wrap around some poorly designed (java?) objects in order to be able to handle them in functional style:
val rs: ResultSet = jdbcQuery.executeQuery()
new Iterator {
def next = rs
def hasNext = rs.next
}.map { rs =>
fetchData(rs)
}
Streams are similar to iterators - they are also lazy, and also useful for wrapping:
Stream.continually(rs).takeWhile { _.next }.map(fetchData)
The main difference though is that streams remember the data that gets materialized, so that you can traverse them more than once. This is convenient, but may be costly if the original amount of data is very large, especially, if it gets filtered down to much smaller size:
Source
.fromFile("huge_file.txt")
.getLines
.filter(_ == "")
.toList
This only uses, roughly (ignoring buffering, object overhead, and other implementation specific details), the amount of memory, necessary to keep one line in memory, plus however many empty lines there are in the file.
This on the other hand:
val reader = new FileReader("huge_file.txt")
Stream
.continually(reader.readLine)
.takeWhile(_ != null)
.filter(_ == "")
.toList
... will end up with the entire content of the huge_file.txt in memory.
Finally, if I understand the intent of your example correctly, here is how you could do it with iterators:
val iterator = Seq(1,2,3).iterator.map { n => Thread.sleep(1000); n }
iterator.foreach(println)
// Or while(iterator.hasNext) { println(iterator.next) } as you had it.

There is a good explanation of what iterator is http://www.scala-lang.org/docu/files/collections-api/collections_43.html
An iterator is not a collection, but rather a way to access the
elements of a collection one by one. The two basic operations on an
iterator it are next and hasNext. A call to it.next() will return the
next element of the iterator and advance the state of the iterator.
Calling next again on the same iterator will then yield the element
one beyond the one returned previously. If there are no more elements
to return, a call to next will throw a NoSuchElementException.

First of all you should understand what is wrong with your example:
lazy val a = Iterator({Thread.sleep(1); 1}, {Thread.sleep(1); 2},
{Thread.sleep(2); 3}) while(a.hasNext){ println(a.next()) }
if you look at the apply method of Iterator, you'll see there are no calls by name,so all Thread.sleep are calling at the same time when apply method calls. Also Thread.sleep takes parameter of time to sleep in milliseconds, so if you want to sleep your thread on one second you should pass Thread.sleep(1000).
The companion object has additional methods which allow you do the next:
val a = Iterator.iterate(1)(x => {Thread.sleep(1000); x+1})
Iterator is very useful when you need to work with large data. Also you can implement your own:
val it = new Iterator[Int] {
var i = -1
def hasNext = true
def next(): Int = { i += 1; i }
}

I don't see anything I can do with it and that is not possible with an Iterable
In fact, what most collection can do can also be done with Array, but we don't do that because it's much less convenient
So same reason apply to iterator, if you want to model a mutable state, then iterator makes more sense.
For example, Random is implemented in a way resemble to iterator because it's use case fit more naturally in iterator, rather than iterable.

Scala - weird behaviour with Iterator.toList

I am new to Scala and I have a function as follows:
def selectSame(messages: BufferedIterator[Int]) = {
val head = messages.head
messages.takeWhile(_ == head)
}
Which is selecting from a buffered iterator only the elems matching the head. I am subsequently using this code:
val messageStream = List(1,1,1,2,2,3,3)
if (!messageStream.isEmpty) {
var lastTimeStamp = messageStream.head.timestamp
while (!messageStream.isEmpty) {
val messages = selectSame(messageStream).toList
println(messages)
}
Upon first execution I am getting (1,1,1) as expected, but then I only get the List(2), like if I lost one element down the line... Probably I am doing sth wrong with the iterators/lists, but I am a bit lost here.

Scaladoc of Iterator says about takeWhile:
Reuse: After calling this method, one should discard the iterator it
was called on, and use only the iterator that was returned. Using the
old iterator is undefined, subject to change, and may result in
changes to the new iterator as well.
So that's why. This basically means you cannot directly do what you want with Iterators and takeWhile. IMHO, easiest would be to quickly write your own recursive function to do that.
If you want to stick with Iterators, you could use the sameElements method on the Iterator to generate a duplicate where you'd call dropWhile.
Even better: Use span repeatedly:
def selectSame(messages: BufferedIterator[Int]) = {
val head = messages.head
messages.span(_ == head)
}
def iter(msgStream: BufferedIterator[Int]): Unit = if (!msgStream.isEmpty) {
val (msgs, rest) = selectSame(msgStream)
println(msgs.toList)
iter(rest)
}
val messageStream = List(1,1,1,2,2,3,3)
if (!messageStream.isEmpty) {
var lastTimeStamp = messageStream.head.timestamp
iter(messageStream0
}

Functional style early exit from depth-first recursion

I have a question about writing recursive algorithms in a functional style. I will use Scala for my example here, but the question applies to any functional language.
I am doing a depth-first enumeration of an n-ary tree where each node has a label and a variable number of children. Here is a simple implementation that prints the labels of the leaf nodes.
case class Node[T](label:T, ns:Node[T]*)
def dfs[T](r:Node[T]):Seq[T] = {
if (r.ns.isEmpty) Seq(r.label) else for (n<-r.ns;c<-dfs(n)) yield c
}
val r = Node('a, Node('b, Node('d), Node('e, Node('f))), Node('c))
dfs(r) // returns Seq[Symbol] = ArrayBuffer('d, 'f, 'c)
Now say that sometimes I want to be able to give up on parsing oversize trees by throwing an exception. Is this possible in a functional language? Specifically is this possible without using mutable state? That seems to depend on what you mean by "oversize". Here is a purely functional version of the algorithm that throws an exception when it tries to handle a tree with a depth of 3 or greater.
def dfs[T](r:Node[T], d:Int = 0):Seq[T] = {
require(d < 3)
if (r.ns.isEmpty) Seq(r.label) else for (n<-r.ns;c<-dfs(n, d+1)) yield c
}
But what if a tree is oversized because it is too broad rather than too deep? Specifically what if I want to throw an exception the n-th time the dfs() function is called recursively regardless of how deep the recursion goes? The only way I can see how to do this is to have a mutable counter that is incremented with each call. I can't see how to do it without a mutable variable.
I'm new to functional programming and have been working under the assumption that anything you can do with mutable state can be done without, but I don't see the answer here. The only thing I can think to do is write a version of dfs() that returns a view over all the nodes in the tree in depth-first order.
dfs[T](r:Node[T]):TraversableView[T, Traversable[_]] = ...
Then I could impose my limit by saying dfs(r).take(n), but I don't see how to write this function. In Python I'd just create a generator by yielding nodes as I visited them, but I don't see how to achieve the same effect in Scala. (Scala's equivalent to a Python-style yield statement appears to be a visitor function passed in as a parameter, but I can't figure out how to write one of these that will generate a sequence view.)
EDIT Getting close to the answer.
Here is an function that returns a Stream of nodes in depth-first order.
def dfs[T](r: Node[T]): Stream[Node[T]] = {
(r #:: Stream.empty /: r.ns)(_ ++ dfs(_))
}
That is almost it. The only problem is that Stream memoizes all results, which is a waste of memory. I want a traversable view. The following is the idea, but does not compile.
def dfs[T](r: Node[T]): TraversableView[Node[T], Traversable[Node[T]]] = {
(Traversable(r).view /: r.ns)(_ ++ dfs(_))
}
It gives a "found TraversableView[Node[T], Traversable[Node[T]]], required TraversableView[Node[T], Traversable[_]] error for the ++ operator. If I change the return type to TraversableView[Node[T], Traversable[_]], I get the same problem with the "found" and "required" clauses switched. So there's some magic type variance incantation I haven't lit upon yet, but this is close.

It can be done: you just have to write some code to actually iterate through the children in the way you want (as opposed to relying on for).
More explicitly, you'll have to write code to iterate through a list of children and check if the "depth" crossed your threshold. Here's some Haskell code (I'm really sorry, I'm not fluent in Scala, but this can probably be easily transliterated):
http://ideone.com/O5gvhM
In this code, I've basically replaced the for loop for an explicit recursive version. This allows me to stop the recursion if the number of visited nodes is already too deep (i.e., limit is not positive). When I recurse to examine the next child, I subtract the number of nodes the dfs of the previous child visited and set this as the limit for the next child.
Functional languages are fun, but they're a huge leap from imperative programming. It really makes you pay attention to the concept of state, because all of it is excruciatingly explicit in the arguments when you go functional.
EDIT: Explaining this a bit more.
I ended up converting from "print just the leaf nodes" (which was the original algorithm from the OP) to "print all nodes". This enabled me to have access to the number of nodes the subcall visited through the length of the resulting list. If you want to stick to the leaf nodes, you'll have to carry around how many nodes you have already visited:
http://ideone.com/cIQrna
EDIT again To clear up this answer, I'm putting all the Haskell code on ideone, and I've transliterated my Haskell code to Scala, so this can stay here as the definite answer to the question:
case class Node[T](label:T, children:Seq[Node[T]])
case class TraversalResult[T](num_visited:Int, labels:Seq[T])
def dfs[T](node:Node[T], limit:Int):TraversalResult[T] =
limit match {
case 0 => TraversalResult(0, Nil)
case limit =>
node.children match {
case Nil => TraversalResult(1, List(node.label))
case children => {
val result = traverse(node.children, limit - 1)
TraversalResult(result.num_visited + 1, result.labels)
}
}
}
def traverse[T](children:Seq[Node[T]], limit:Int):TraversalResult[T] =
limit match {
case 0 => TraversalResult(0, Nil)
case limit =>
children match {
case Nil => TraversalResult(0, Nil)
case first :: rest => {
val trav_first = dfs(first, limit)
val trav_rest =
traverse(rest, limit - trav_first.num_visited)
TraversalResult(
trav_first.num_visited + trav_rest.num_visited,
trav_first.labels ++ trav_rest.labels
)
}
}
}
val n = Node(0, List(
Node(1, List(Node(2, Nil), Node(3, Nil))),
Node(4, List(Node(5, List(Node(6, Nil))))),
Node(7, Nil)
))
for (i <- 1 to 8)
println(dfs(n, i))
Output:
TraversalResult(1,List())
TraversalResult(2,List())
TraversalResult(3,List(2))
TraversalResult(4,List(2, 3))
TraversalResult(5,List(2, 3))
TraversalResult(6,List(2, 3))
TraversalResult(7,List(2, 3, 6))
TraversalResult(8,List(2, 3, 6, 7))
P.S. this is my first attempt at Scala, so the above probably contains some horrid non-idiomatic code. I'm sorry.

You can convert breadth into depth by passing along an index or taking the tail:
def suml(xs: List[Int], total: Int = 0) = xs match {
case Nil => total
case x :: rest => suml(rest, total+x)
}
def suma(xs: Array[Int], from: Int = 0, total: Int = 0) = {
if (from >= xs.length) total
else suma(xs, from+1, total + xs(from))
}
In the latter case, you already have something to limit your breadth if you want; in the former, just add a width or somesuch.

The following implements a lazy depth-first search over nodes in a tree.
import collection.TraversableView
case class Node[T](label: T, ns: Node[T]*)
def dfs[T](r: Node[T]): TraversableView[Node[T], Traversable[Node[T]]] =
(Traversable[Node[T]](r).view /: r.ns) {
(a, b) => (a ++ dfs(b)).asInstanceOf[TraversableView[Node[T], Traversable[Node[T]]]]
}
This prints the labels of all the nodes in depth-first order.
val r = Node('a, Node('b, Node('d), Node('e, Node('f))), Node('c))
dfs(r).map(_.label).force
// returns Traversable[Symbol] = List('a, 'b, 'd, 'e, 'f, 'c)
This does the same thing, quitting after 3 nodes have been visited.
dfs(r).take(3).map(_.label).force
// returns Traversable[Symbol] = List('a, 'b, 'd)
If you want only leaf nodes you can use filter, and so forth.
Note that the fold clause of the dfs function requires an explicit asInstanceOf cast. See "Type variance error in Scala when doing a foldLeft over Traversable views" for a discussion of the Scala typing issues that necessitate this.

Is there a way to handle the last case differently in a Scala for loop?

For example suppose I have
for (line <- myData) {
println("}, {")
}
Is there a way to get the last line to print
println("}")

Can you refactor your code to take advantage of built-in mkString?
scala> List(1, 2, 3).mkString("{", "}, {", "}")
res1: String = {1}, {2}, {3}

Before going any further, I'd recommend you avoid println in a for-comprehension. It can sometimes be useful for tracking down a bug that occurs in the middle of a collection, but otherwise leads to code that's harder to refactor and test.
More generally, life usually becomes easier if you can restrict where any sort of side-effect occurs. So instead of:
for (line <- myData) {
println("}, {")
}
You can write:
val lines = for (line <- myData) yield "}, {"
println(lines mkString "\n")
I'm also going to take a guess here that you wanted the content of each line in the output!
val lines = for (line <- myData) yield (line + "}, {")
println(lines mkString "\n")
Though you'd be better off still if you just used mkString directly - that's what it's for!
val lines = myData.mkString("{", "\n}, {", "}")
println(lines)
Note how we're first producing a String, then printing it in a single operation. This approach can easily be split into separate methods and used to implement toString on your class, or to inspect the generated String in tests.

I agree fully with what has been said before about using mkstring, and distinguishing the first iteration rather than the last one. Would you still need to distinguish on the last, scala collections have an init method, which return all elements but the last.
So you can do
for(x <- coll.init) workOnNonLast(x)
workOnLast(coll.last)
(init and last being sort of the opposite of head and tail, which are the first and and all but first). Note however than depending on the structure, they may be costly. On Vector, all of them are fast. On List, while head and tail are basically free, init and last are both linear in the length of the list. headOption and lastOption may help you when the collection may be empty, replacing workOnlast by
for (x <- coll.lastOption) workOnLast(x)

You may take the addString function of the TraversableOncetrait as an example.
def addString(b: StringBuilder, start: String, sep: String, end: String): StringBuilder = {
var first = true
b append start
for (x <- self) {
if (first) {
b append x
first = false
} else {
b append sep
b append x
}
}
b append end
b
}
In your case, the separator is }, { and the end is }

If you don't want to use built-in mkString function, you can make something like
for (line <- lines)
if (line == lines.last) println("last")
else println(line)
UPDATE: As didierd mentioned in comments, this solution is wrong because last value can occurs several times, he provides better solution in his answer.
It is fine for Vectors, because last function takes "effectively constant time" for them, as for Lists, it takes linear time, so you can use pattern matching
#tailrec
def printLines[A](l: List[A]) {
l match {
case Nil =>
case x :: Nil => println("last")
case x :: xs => println(x); printLines(xs)
}
}

Other answers are rightfully pointed to mkString, and for a normal amount of data I would also use that.
However, mkString builds (accumulates) the end-result in-memory through a StringBuilder. This is not always desirable, depending on the amount of data we have.
In this case, if all we want is to "print" we don't need to build the big-result first (and maybe we even want to avoid this).
Consider the implementation of this helper function:
def forEachIsLast[A](iterator: Iterator[A])(operation: (A, Boolean) => Unit): Unit = {
while(iterator.hasNext) {
val element = iterator.next()
val isLast = !iterator.hasNext // if there is no "next", this is the last one
operation(element, isLast)
}
}
It iterates over all elements and invokes operation passing each element in turn, with a boolean value. The value is true if the element passed is the last one.
In your case it could be used like this:
forEachIsLast(myData) { (line, isLast) =>
if(isLast)
println("}")
else
println("}, {")
}
We have the following advantages here:
It operates on each element, one by one, without necessarily accumulating the result in memory (unless you want to).
Because it does not need to load the whole collection into memory to check its size, it's enough to ask the Iterator if it's exhausted or not. You could read data from a big file, or from the network, etc.