Scala Iterator overwritten [duplicate] - scala

I am trying to learn Scala through a sample project that I have. In it there is a variable record defined as:
val records: Iterator[Product2[K, V]]
It is passed around in different methods. I explore its contents using :
records.foreach(println)
However, when I try to print the contents using this iterator again, even in successive lines of code, I get no results. It seems as if the iterator is consumed. How do prevent it from happening and be able to explore the contents of the iterator without rendering it useless for the rest of the code?

An Iterator extends TraversableOnce and hence can only be iterated over once, as it represents a mutating pointer into an Iterable. If you want something that can be traversable repeatedly and without affecting multiple, parallel accesses, you need to use the Iterable instead, which extends Traversable and on foreach creates a new Iterator for that specific context

Related

Why `Source.fromFile(...).getLines()` is empty after I've iterated over it?

It was quite a surprise for me that (line <- lines) is so devastating! It completely unwinds lines iterator. So running the following snippet will make size = 0 :
val lines = Source.fromFile(args(0)).getLines()
var cnt = 0
for (line <- lines) {
cnt = readLines(line, cnt)
}
val size = lines.size
Is it a normal Scala practice to have well-hidden side-effects like this?
Source.getLines() returns an iterator. For every iterator, if you invoke a bulk operation such as foreach above, or map, take, toList, etc., then the iterator is no longer in a usable state.
That is the contract for Iterators and, more generally, classes that inherit TraversableOnce.
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext.
This is not the case for classes that inherit Traversable -- for those you can invoke the bulk traversal operations as many times as you want.
Source.getLines() returns an Iterator, and walking through an Iterator will mutate it. This is made quite clear in the Scala documentation
An iterator is mutable: most operations on it change its state. While it is often used to iterate through the elements of a collection, it can also be used without being backed by any collection (see constructors on the companion object).
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext.
Using for notation is just syntactic sugar for calling map, flatMap and foreach methods on the Iterator, which again have quite clear documentation stating not to use the iterator:
Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
Scala generally aims to be a 'pragmatic' language - mutation and side effects are allowed for performance and inter-operability reasons, although not encouraged. To call it 'well-hidden' is, however, something of a stretch.

Scala Iterable Memory Leaks

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.
The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.
Here's my repro code:
class FileIterable(file: java.io.File) extends Iterable[String] {
override def iterator = io.Source.fromFile(file).getLines
}
// Iterator
// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines
// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator
// Iterable
// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable
// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)
def values = lines
.drop(1)
//.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
//.filter(l => l.startsWith("*"))
val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()
The file it's reading is ~10GB with 1MB lines.
The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)
The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.
The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.
My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.
Any ideas?
Note how the ScalaDoc of Iterable says:
Implementations of this trait need to provide a concrete method with
signature:
def iterator: Iterator[A]
They also need to provide a method newBuilder which creates a builder
for collections of the same kind.
Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.
You might want to implement Iterable.drop as
def drop(n: Int) = iterator.drop(n).toIterable
but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).

Scala immutable map, when to go mutable?

My present use case is pretty trivial, either mutable or immutable Map will do the trick.
Have a method that takes an immutable Map, which then calls a 3rd party API method that takes an immutable Map as well
def doFoo(foo: String = "default", params: Map[String, Any] = Map()) {
val newMap =
if(someCondition) params + ("foo" -> foo) else params
api.doSomething(newMap)
}
The Map in question will generally be quite small, at most there might be an embedded List of case class instances, a few thousand entries max. So, again, assume little impact in going immutable in this case (i.e. having essentially 2 instances of the Map via the newMap val copy).
Still, it nags me a bit, copying the map just to get a new map with a few k->v entries tacked onto it.
I could go mutable and params.put("bar", bar), etc. for the entries I want to tack on, and then params.toMap to convert to immutable for the api call, that is an option. but then I have to import and pass around mutable maps, which is a bit of hassle compared to going with Scala's default immutable Map.
So, what are the general guidelines for when it is justified/good practice to use mutable Map over immutable Maps?
Thanks
EDIT
so, it appears that an add operation on an immutable map takes near constant time, confirming #dhg's and #Nicolas's assertion that a full copy is not made, which solves the problem for the concrete case presented.
Depending on the immutable Map implementation, adding a few entries may not actually copy the entire original Map. This is one of the advantages to the immutable data structure approach: Scala will try to get away with copying as little as possible.
This kind of behavior is easiest to see with a List. If I have a val a = List(1,2,3), then that list is stored in memory. However, if I prepend an additional element like val b = 0 :: a, I do get a new 4-element List back, but Scala did not copy the orignal list a. Instead, we just created one new link, called it b, and gave it a pointer to the existing List a.
You can envision strategies like this for other kinds of collections as well. For example, if I add one element to a Map, the collection could simply wrap the existing map, falling back to it when needed, all while providing an API as if it were a single Map.
Using a mutable object is not bad in itself, it becomes bad in a functional programming environment, where you try to avoid side-effects by keeping functions pure and objects immutable.
However, if you create a mutable object inside a function and modify this object, the function is still pure if you don't release a reference to this object outside the function. It is acceptable to have code like:
def buildVector( x: Double, y: Double, z: Double ): Vector[Double] = {
val ary = Array.ofDim[Double]( 3 )
ary( 0 ) = x
ary( 1 ) = y
ary( 2 ) = z
ary.toVector
}
Now, I think this approach is useful/recommended in two cases: (1) Performance, if creating and modifying an immutable object is a bottleneck of your whole application; (2) Code readability, because sometimes it's easier to modify a complex object in place (rather than resorting to lenses, zippers, etc.)
In addition to dhg's answer, you can take a look to the performance of the scala collections. If an add/remove operation doesn't take a linear time, it must do something else than just simply copying the entire structure. (Note that the converse is not true: it's not beacuase it takes linear time that your copying the whole structure)
I like to use collections.maps as the declared parameter types (input or return values) rather than mutable or immutable maps. The Collections maps are immutable interfaces that work for both types of implementations. A consumer method using a map really doesn't need to know about a map implementation or how it was constructed. (It's really none of its business anyway).
If you go with the approach of hiding a map's particular construction (be it mutable or immutable) from the consumers who use it then you're still getting an essentially immutable map downstream. And by using collection.Map as an immutable interface you completely remove all the ".toMap" inefficiency that you would have with consumers written to use immutable.Map typed objects. Having to convert a completely constructed map into another one simply to comply to an interface not supported by the first one really is absolutely unnecessary overhead when you think about it.
I suspect in a few years from now we'll look back at the three separate sets of interfaces (mutable maps, immutable maps, and collections maps) and realize that 99% of the time only 2 are really needed (mutable and collections) and that using the (unfortunately) default immutable map interface really adds a lot of unnecessary overhead for the "Scalable Language".

Create an immutable list from a java.lang.Iterator

I'm using a library (JXPath) to query a graph of beans in order to extract matching elements. However, JXPath returns groups of matching elements as an instance of java.lang.Iterator and I'd rather like to convert it into an immutable scala list. Is there any simpler way of doing than iterating over the iterator and creating a new immutable list at each iteration step ?
You might want to rethink the need for a List, although it feels very familiar when coming from Java, and List is the default implementation of an immutable Seq, it often isn't the best choice of collection.
The operations that list is optimal for are those already available via an iterator (basically taking consecutive head elements and prepending elements). If an iterator doesn't already give you what you need, then I can pretty much guarantee that a List won't be your best choice - a vector would be more appropriate.
Having got that out the way... The recommended technique to convert between Java and Scala collections (since Scala 2.8.1) is via scala.collection.JavaConverters. This gives you more control than JavaConversions and avoids some possible implicit conflicts.
You won't have a direct implicit conversion this way. Instead, you get asScala and asJava methods pimped onto collections, allowing you to perform the conversions explicitly.
To convert a Java iterator to a Scala iterator:
javaIterator.asScala
To convert a Java iterator to a Scala List (via the scala iterator):
javaIterator.asScala.toList
You may also want to consider converting toSeq instead of toList. In the case of iterators, this'll return a Stream - allowing you to retain the lazy behaviour of iterators within the richer Seq interface.
EDIT:
There's no toVector method, but (as Daniel pointed out) there's a toIndexedSeq method that will return a Vector as the default IndexedSeq subclass (just as List is the default Seq).
javaIterator.asScala.toIndexedSeq
EDIT: You should probably look at Kevin Wright's answer, which provides a better solution available since Scala 2.8.1, with less implicit magic.
You can import the implicit conversions from scala.collection.JavaConversions and then create a new Scala collection seamlessly, e.g. like this:
import collection.JavaConversions._
println(List() ++ javaIterator)
Your Java iterator is converted to a Scala iterator by JavaConversions.asScalaIterator. A Scala iterator with elements of type A implements TraversableOnce[A], which is the argument type needed to concatenate collections with ++.
If you need another collection type, just change List() to whatever you need (e.g., IndexedSeq() or collection.mutable.Seq(), etc.).

Difference between MutableList and ListBuffer

What is the difference between Scala's MutableList and ListBuffer classes in scala.collection.mutable? When would you use one vs the other?
My use case is having a linear sequence where I can efficiently remove the first element, prepend, and append. What's the best structure for this?
A little explanation on how they work.
ListBuffer uses internally Nil and :: to build an immutable List and allows constant-time removal of the first and last elements. To do so, it keeps a pointer on the first and last element of the list, and is actually allowed to change the head and tail of the (otherwise immutable) :: class (nice trick allowed by the private[scala] var members of ::). Its toList method returns the normal immutable List in constant time as well, as it can directly return the structure maintained internally. It is also the default builder for immutable Lists (and thus can indeed be reasonably expected to have constant-time append). If you call toList and then again append an element to the buffer, it takes linear time with respect to the current number of elements in the buffer to recreate a new structure, as it must not mutate the exported list any more.
MutableList works internally with LinkedList instead, an (openly, not like ::) mutable linked list implementation which knows of its element and successor (like ::). MutableList also keeps pointers to the first and last element, but toList returns in linear time, as the resulting List is constructed from the LinkedList. Thus, it doesn't need to reinitialize the buffer after a List has been exported.
Given your requirements, I'd say ListBuffer and MutableList are equivalent. If you want to export their internal list at some point, then ask yourself where you want the overhead: when you export the list, and then no overhead if you go on mutating buffer (then go for MutableList), or only if you mutable the buffer again, and none at export time (then go for ListBuffer).
My guess is that in the 2.8 collection overhaul, MutableList predated ListBuffer and the whole Builder system. Actually, MutableList is predominantly useful from within the collection.mutable package: it has a private[mutable] def toLinkedList method which returns in constant time, and can thus efficiently be used as a delegated builder for all structures that maintain a LinkedList internally.
So I'd also recommend ListBuffer, as it may also get attention and optimization in the future than “purely mutable” structures like MutableList and LinkedList.
This gives you an overview of the performance characteristics: http://www.scala-lang.org/docu/files/collections-api/collections.html ; interestingly, MutableList and ListBuffer do not differ there. The documentation of MutableList says it is used internally as base class for Stack and Queue, so maybe ListBuffer is more the official class from the user perspective?
You want a list (why a list?) that is growable and shrinkable, and you want constant append and prepend. Well, Buffer, a trait, has constant append and prepend, with most other operations linear. I'm guessing that ListBuffer, a class that implements Buffer, has constant time removal of the first element.
So, my own recommendation is for ListBuffer.
First, lets go over some of the relevant types in Scala
List - An Immutable collection. A Recursive implementation i.e . i.e An instance of list has two primary elements the head and the tail, where the tail references another List.
List[T]
head: T
tail: List[T] //recursive
LinkedList - A mutable collection defined as a series of linked nodes, where each node contains a value and a pointer to the next node.
Node[T]
value: T
next: Node[T] //sequential
LinkedList[T]
first: Node[T]
List is a functional data structure (immutability) compared to LinkedList which is more standard in imperative languages.
Now, lets look at
ListBuffer - A mutable buffer implementation backed by a List.
MutableList - An implementation based on LinkedList ( Would have been more self explanatory if it had been named LinkedListBuffer instead )
They both offer similar complexity bounds on most operations.
However, if you request a List from a MutableList, then it has to convert the existing linear representation into the recursive representation which takes O(n) which is what #Jean-Philippe Pellet points out. But, if you request a Seq from MutableList the complexity is O(1).
So, IMO the choice narrows down to the specifics of your code and your preference. Though, I suspect there is a lot more List and ListBuffer out there.
Note that ListBuffer is final/sealed, while you can extend MutableList.
Depending on your application, extensibility may be useful.