Scala Iterable Memory Leaks - scala

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.
The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.
Here's my repro code:
class FileIterable(file: java.io.File) extends Iterable[String] {
override def iterator = io.Source.fromFile(file).getLines
}
// Iterator
// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines
// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator
// Iterable
// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable
// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)
def values = lines
.drop(1)
//.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
//.filter(l => l.startsWith("*"))
val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()
The file it's reading is ~10GB with 1MB lines.
The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)
The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.
The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.
My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.
Any ideas?

Note how the ScalaDoc of Iterable says:
Implementations of this trait need to provide a concrete method with
signature:
def iterator: Iterator[A]
They also need to provide a method newBuilder which creates a builder
for collections of the same kind.
Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.
You might want to implement Iterable.drop as
def drop(n: Int) = iterator.drop(n).toIterable
but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).

Related

Is it safe to remove elements from a collection.mutable.HashSet during iteration?

A mutable Set's retain method is implemented as follows:
def retain(p: A => Boolean): Unit =
for (elem <- this.toList) // SI-7269 toList avoids ConcurrentModificationException
if (!p(elem)) this -= elem
But if I implement my own method that doesn't make a copy for iterating, nothing blows up.
def dumbRetain[A](self: mutable.Set[A], p: A => Boolean): Unit =
for (elem <- self)
if (!p(elem)) self -= elem
dumbRetain(mutable.HashSet(1,2,3,4,5,6), Set(2,4,6))
// everything is ok
I see that SI-7269's test case uses the JavaConversions wrapper around a java Set/Map, and it seems like the issue arises from the underlying java collection.
I know there will never be a java collection passed to my algorithm, so can I use dumbRetain without worrying about the ConcurrentModificationException? Or is this "coincidental behavior" that I shouldn't rely on?
edit to clarify, I would be using dumbRetain as an implementation detail in an algorithm which would be in full control of what it passes to dumbRetain. And this would be run in a single-threaded context.
This seems to rely on the specific implementation of mutable.HashSet, and there is nothing in the API that guarantees that it would work for all other implementations of mutable.Set, even if we exclude all wrappers for the Java collections.
The for-loop
for (elem <- self) {
...
}
is desugared into foreach, which for mutable.HashSet is implemented as follows:
override def foreach[U](f: A => U) {
var i = 0
val len = table.length
while (i < len) {
val curEntry = table(i)
if (curEntry ne null) f(entryToElem(curEntry))
i += 1
}
}
Essentially, it simply loops through the Array of the underlying FlatHashTable, and invokes the passed function f on every element. The whole foreach simply does not have any lines which could throw anything, it doesn't check for concurrent [footnote-1] modifications at all.
A ConcurrentModificationException seems to be the less troubling case: at least, your program fails fast, and even returns a detailed stack trace that points to the line in which the problem occurred. It would be actually much worse if it simply deteriorated into undefined behavior without throwing anything. This would be the worst case. However, this worst case shouldn't occur for collections from the standard library: Throw ConcurrentModificationException exception's in scala collections? #188
Quote:
In scala/scala#5295 (merged in to 2.12.x) I made sure that removing the element last returned from an iterator would not cause a problem for the iterator.
So, as long as you clearly state in the documentation that only the collections from standard library are supported, you will most likely not have any problems using it in your own code. But if you use it in a public interface, this would be an invitation for a bug analogous to "SI-7269" quoted in your question.
[footnote-1] "concurrent" as in "ConcurrentModificationException", not as in "concurrently executed threads".
EDIT: I've tried to choose less ambiguous formulations. Great Thanks #Dima for the feedback and the numerous suggestions.
Yeah, you can do it, as long as you are sure this is the scala's native HashSet implementation, not a wrapper around java ... and with understanding, that this is not thread-safe, and should never be used concurrently (the original HashSet.retain is that way too as well as the other mutators).
Better yet, just use immutable Set.filter, unless you actually have real hard evidence (not just intuition) demonstrating that your specific case absolutely requires mutable container.

Scala large list allocation causes heavy garbage collection

EDIT, Summary: So, in the long chain of back and too, I think the "final answer" is a little hard to find. In essence however, Yuval pointed out that the incremental allocation of a large amount of memory forces a heap resize (actually, two by the look of the graph). A heap resize on a normal JVM involves a full GC, the most expensive, timeconsuming, collection possible. So, the reality is that my process isn't collecting lots of garbage per se, rather its doing heap resizes which inherently trigger expensive GC as part of the heap reorganization process. Those of us more familiar with Java than Scala are more likely to have allocated a simple ArrayList, which even if it causes heap resizing, is only a few objects (and likely allocated directly into old-gen if it's a big array) which would be far less work--because it's far fewer objects!--for a full GC anyway. Moral is likely that some other structure would be more appropriate for very large "lists".
I was trying to experiment with some of Scala's data structures (actually, with the parallel stuff, but that's not relevant to the problem I bumped into). I'm trying to create a fairly long list (with the intention of processing it purely sequentially). But try as I might, I'm failing to create a simple list without invoking vast quantities of garbage collection. I'm fairly sure that I'm simply pre-pending the new items to the existing tail, but the GC load suggests that I'm not. I've tried a couple of techniques so far (I'm starting to suspect that I'm misunderstanding something truly fundamental about this structure :( )
Here's the first effort:
val myList = {
#tailrec
def addToList(remaining:Long, acc:List[Double]): List[Double] =
if (remaining > 0) addToList(remaining - 1, 0 :: acc)
else acc
addToList(10000000, Nil)
}
And when I began to doubt I knew how to do recursion after all, I came up with this mutating beast.
val myList = {
var rv: List[Double] = Nil
var count = 10000000
while (count > 0) {
rv = 0.0 :: rv
}
rv
}
They both give the same effect: 8 cores running flat out doing GC (according to jvisualvm) and memory allocation reaching peaks at just over 1GB, which I assume is the real allocated space required for the data, but on the way, it creates a seemingly vast amount of trash on the way.
Am I doing something horribly wrong here? Am I somehow forcing the recreation of the entire list with every new element (I'm trying very hard to only do "prepend" type operations, which I thought should avoid that).
Or maybe, I have half a memory of hearing that Scala List does something odd to help it transform into a mutable list, or a parallel list, or something. Really don't recall what. Is this something to do with that? And if so, what the heck was "that" anyway?
Oh, and here's the image of the GC process. Notice the front-loading on the otherwise triangular rise of the memory that represents the "real" allocated data. That huge hump, and the associated CPU usage are my problem:
EDIT: I should clarify, I'm interested in two things. First, if my creation of the list is intrinsically faulty (i.e. if I'm not in fact only performing prepend operations) then I'd like to understand why, and how I should do this "right". Second, if my construction is sound and the odd behavior is intrinsic in the List, I'd like to understand the List better, so I know what it's doing, and why. I'm not particularly interested (at this point) in alternative ways to build a sequential data structure that sidesteps this problem. I anticipate using List a lot, and would like to know what's happening. (Later, I might well want to investigate other structures in this level of detail, but not right now).
First, if my creation of the list is intrinsically faulty (i.e. if
I'm not in fact only performing prepend operations) then I'd like to
understand why
You are constructing the list properly, there's no problem there.
Second, if my construction is sound and the odd behavior is intrinsic
in the List, I'd like to understand the List better, so I know what
it's doing, and why
List[A] in Scala is based on a linked list implementation, where you have a head of type A, and a tail of type List[A]. List[A] is an abstract class with two implementations, one presenting the empty list called Nil, and one called "Cons", or ::, representing a list which has a head value and a tail, which can be either full or empty:
def ::[B >: A] (x: B): List[B] =
new scala.collection.immutable.::(x, this)
If we look at the implementation for ::, we can see that it is a simple case class with two fields:
final case class ::[B](override val head: B, private[scala] var tl: List[B]) extends List[B] {
override def tail : List[B] = tl
override def isEmpty: Boolean = false
}
A quick look using the memory tab in IntelliJ shows:
That we have ten million Double values, and ten million instances of the :: case class, which in itself has additional overhead for being a case class (the compiler "enhances" these classes with additional structure).
Your JVisualVM instance doesn't show the GC graph being fully utilized, it is rather showing your CPU is overworked from generating the large list of items. During the allocation process, you generate a lot of intermediate lists until you reach your fully generated list, which means data has to be evicted between the different GC levels (Eden, Survivor and Old, assuming you're running the JVM flavor of Scala).
If we want a bit more information, we can use Mission Control to see into what's causing the memory pressure. This is a sample generated from a 30 second profile running:
def main(args: Array[String]): Unit = {
def myList: List[Double] = {
#tailrec
def addToList(remaining:Long, acc:List[Double]): List[Double] =
if (remaining > 0) addToList(remaining - 1, 0 :: acc)
else acc
addToList(10000000, Nil)
}
while (true) {
myList
}
}
We see that we have a call to BoxesRunTime.boxToDouble which happens due to the fact that :: is a generic class and doesn't have a #specialized attribute for double. We go scala.Int -> scala.Double -> java.lang.Double.

Lazy eager map evaluation

There are basically two options to evaluate a map in Scala.
Lazy evaluation computers the function that is passed as a parameter when the next value is needed. IF the function takes one hour to execute then it's one hour to wait when the value is needed. (e.g. Stream and Iterator)
Eager evaluation computes the function when the map is defined. It produces a new list (Vector or whatever) and stores the results, making the program to be busy in that time.
With Future we can obtain the list (Seq or whatever) in a separate thread, this means that our thread doesn't block, but the results have to be stored.
So I did something different, please check it here.
This was a while ago so I don't remember whether I tested it. The point is to have a map that applies concurrently (non-blocking) and kind of eagerly to a set of elements, filling a buffer (the size of the number of cores in the computer, and not more). This means that:
The invocation of the map doesn't block the current thread.
Obtaining an element doesn't block the current thread (in case there was time to calculate it before and store the result in the buffer).
Infinite lists can be handled because we only prefetch a few results (about 8, depending on the number of cores).
So this all sounds very good and you may be wondering what's the problem. The problem is that this solution is not particularly elegant IMHO. Assuming the code I shared works in Java and or Scala, to iterate over the elements in the iterable produced by the map I would only need to write:
new CFMap(whateverFunction).apply(whateverIterable)
However what I would like to write is something like:
whateverIterable.bmap(whateverFunction)
As it is usual in Scala (the 'b' is for buffered), or perhaps something like:
whateverThing.toBuffered.map(whateverFunction)
Either of them works for me. So the question is, how can I do this in an idiomatic way in Scala? Some options:
Monads: create a new collection "Buffered" so that I can use the toBuffered method (that should be added to the previous ones as an implicit) and implement map and everything else for this Buffered thing (sounds like quite some work).
Implicits: create an implicit method that transforms the usual collections or the superclass of them (I'm not sure about which one should it be, Iterable maybe?) to something else to which I can apply the .bmap method and obtain something from it, probably an iterable.
Other: there are probably many options I have not considered so far. It's possible that some library does already implement this (I'd be actually surprised of the opposite, I can't believe nobody thought of this before). Using something that has already been done is usually a good idea.
Please let me know if something is unclear.
What you're looking for is the "pimp-my-library" pattern. Check it out:
object CFMapExtensions {
import sanity.commons.functional.CFMap
import scala.collection.JavaConversions._
implicit class IterableExtensions[I](i: Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(asJavaIterable(i))
}
implicit class JavaIterableExtensions[I](i: java.lang.Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(i)
}
// Add an implicit conversion to a java function.
import java.util.function.{Function => JFunction}
implicit def toJFunction[I, O](f: Function1[I, O]): JFunction[I, O] = {
new JFunction[I, O]() {
def apply(t: I): O = f(t)
}
}
}
object Test extends App {
import CFMapExtensions._
List(1,2,3,4).bmap(_ + 5).foreach(println)
}

Why `Source.fromFile(...).getLines()` is empty after I've iterated over it?

It was quite a surprise for me that (line <- lines) is so devastating! It completely unwinds lines iterator. So running the following snippet will make size = 0 :
val lines = Source.fromFile(args(0)).getLines()
var cnt = 0
for (line <- lines) {
cnt = readLines(line, cnt)
}
val size = lines.size
Is it a normal Scala practice to have well-hidden side-effects like this?
Source.getLines() returns an iterator. For every iterator, if you invoke a bulk operation such as foreach above, or map, take, toList, etc., then the iterator is no longer in a usable state.
That is the contract for Iterators and, more generally, classes that inherit TraversableOnce.
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext.
This is not the case for classes that inherit Traversable -- for those you can invoke the bulk traversal operations as many times as you want.
Source.getLines() returns an Iterator, and walking through an Iterator will mutate it. This is made quite clear in the Scala documentation
An iterator is mutable: most operations on it change its state. While it is often used to iterate through the elements of a collection, it can also be used without being backed by any collection (see constructors on the companion object).
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext.
Using for notation is just syntactic sugar for calling map, flatMap and foreach methods on the Iterator, which again have quite clear documentation stating not to use the iterator:
Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
Scala generally aims to be a 'pragmatic' language - mutation and side effects are allowed for performance and inter-operability reasons, although not encouraged. To call it 'well-hidden' is, however, something of a stretch.

Scala type that is Iterable and has a length?

Writing Scala code, I regularly encounter cases where I have "processor" functions that operate iteratively on a collection of elements and also need to know the length of the collection.
On the other hand I have "provider" functions that generate collections and so already know the length. The generated collections may be List[T], Array[T] or Set[T], etc., but even in the case of List[T], my generator knows the size (even if the List type does not store it).
So I would naturally declare the "processor" functions as taking the most generic type that seems to fit all collection types, Iterable[T], as a parameter. However, they then internally need to find out the size via iterative collection traversal at a cost of O(N), which is undesirable.
So my naive solution would be to create a new type like IterableWithSize[T] and have the provider and processor functions create and take this type. Neither Seq[T] nor IndexedSeq[T] seem to fit the bill. But this seems like a relatively common use case, so I'm suspecting that there is a more idiomatic way to do this. What would that be?
In Scala collections, performance sensitive methods like size are not inherited from traits but overridden in the bottom type. For example see the implementation of immutable.HashSet:
https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_0_1/src//library/scala/collection/immutable/HashSet.scala
So you don't need to care about it. Just define an high-level common trait like Traversable or Iterable and you're done.
Actually, there's no idiomatic way around that. Scala collections were really meant to be traversed or used in other prescribed manners (such as Set.contains or Map.get). Checking for size is not part of them, and some of them are not even finite.
Now, IndexedSeq is a relatively safe bet -- it guarantees O(logn) indexed access, which is only possible if you have O(logn) size. Also, Set and Map are reasonably safe as well, for similar reasons. But if you are looking for a trait that gives you a guarantee on size speed, there isn't one.
I don't think there is an idiomatic way to do this. But here are two alternatives:
(1) Extend Scala's List/Set/Array collections and override the size method. This is not as difficult as it seems at first glance.
(2) Wrap your List/Set/Array collections together with the size and define an implicit unwrapper like:
class IterableWithSizeWrapper[E](private val c: Iterable[E], val size: Int)
object IterableWithSizeWrapper {
implicit def unwrap[E](iws: IterableWithSizeWrapper[E]): Iterable[E] = iws.c
}
object ListWithSizeTest {
def process[E](iws: IterableWithSizeWrapper[E]) {
// iws.size uses your cached size value
// iws.take(i) forces the unwrap to the original collect
// so iws.take(i).size takes the calculated size
for (i <- 0 to iws.size) assert(iws.take(i).size == i)
}
def main(args: Array[String]) {
process(new IterableWithSizeWrapper(List(1,2,3), 3))
process(new IterableWithSizeWrapper(Set(1,2,3), 3))
process(new IterableWithSizeWrapper(Array(1,2,3), 3))
}
}
How about Traversable? All your collections you mention inherit from it (Array indirectly via WrappedArray) and it provides size and toIterable (or toIterator) for traversal.
Your processor functions should accept Seq[T]. A Seq is precisely an Iterable that "has a length". Your only remaining problem is making length efficient. AFAIK it is already efficient in all cases except for List. To make List.length efficient just do as the others describe: Create an implementation of Seq that wraps a List and stores its length.