I'll ask this with a Scala example, but it may well be that this affects other languages which allow hybrid imperative and functional styles.
Here's a short example (UPDATED, see below):
def method: Iterator[Int] {
// construct some large intermediate value
val huge = (1 to 1000000).toList
val small = List.fill(5)(scala.util.Random.nextInt)
// accidentally use huge in a literal
small.iterator filterNot ( huge contains _ )
}
Now iterator.filterNot works lazily, which is great! As a result, we'd expect that the returned iterator won't consume much memory (indeed, O(1)). Sadly, however, we've made a terrible mistake: since filterNot is lazy, it keeps a reference to the function literal huge contains _.
Thus while we thought that the method would require a large amount of memory while it was running, and that that memory could be freed up immediately after the termination of the method, in fact that memory is stuck until we forget the returned Iterator.
(I just made such a mistake, which took a long time to track down! You can catch such things looking at heap dumps ...)
What are best practices for avoiding this problem?
It seems that the only solution is to carefully check for function literals which survive the end of the scope, and which captured intermediate variables. This is a bit awkward if you're constructing a non-strict collection and planning on returning it. Can anyone think of some nice tricks, Scala-specific or otherwise, that avoid this problem and let me write nice code?
UPDATE: the example I'd given previously was stupid, as huynhjl's answer below demonstrates. It had been:
def method: Iterator[Int] {
val huge = (1 to 1000000).toList // construct some large intermediate value
val n = huge.last // do some calculation based on it
(1 to n).iterator map (_ + 1) // return some small value
}
In fact, now that I understand a bit better how these things work, I'm not so worried!
Are you sure you're not oversimplifying the test case? Here is what I run:
object Clos {
def method: Iterator[Int] = {
val huge = (1 to 2000000).toList
val n = huge.last
(1 to n).iterator map (_ + 1)
}
def gc() { println("GC!!"); Runtime.getRuntime.gc }
def main(args:Array[String]) {
val list = List(method, method, method)
list.foreach(m => println(m.next))
gc()
list.foreach(m => println(m.next))
list.foreach(m => println(m.next))
}
}
If I understand you correctly, because main is using the iterators even after the gc() call, the JVM would be holding onto the huge objects.
This is how I run it:
JAVA_OPTS="-verbose:gc" scala -cp classes Clos
This is what it prints towards the end:
[Full GC 57077K->57077K(60916K), 0.3340941 secs]
[Full GC 60852K->60851K(65088K), 0.3653304 secs]
2
2
2
GC!!
[Full GC 62959K->247K(65088K), 0.0610994 secs]
3
3
3
4
4
4
So it looks to me as if the huge objects were reclaimed...
Related
I'm reading a very large number of records sequentially from database API one page at a time (with unknown number of records per page) via call to def readPage(pageNumber: Int): Iterator[Record]
I'm trying to wrap this API in something like either Stream[Iterator[Record]] or Iterator[Iterator[Record]] lazily, in a functional way, ideally no mutable state, with constant memory footprint, so that I can treat it as infinite stream of pages or sequence of Iterators, and abstract away the pagination from the client. Client can iterate on the result, by calling next() it will retrieve the next page (Iterator[Record]).
What is the most idiomatic and efficient way to implement this in Scala.
Edit: need to fetch & process the records one page at a time, cannot maintain all the records from all pages in memory. If one page fails, throw an exception. Large number of pages/records means infinite for all practical purposes. I want to treat it as infinite stream (or iterator) of pages with each page being an iterator for finite number of records (e.g. less <1000 but exact number is unknown ahead if time).
I looked at BatchCursor in Monix but it serves a different purpose.
Edit 2: this is the current version using Tomer's answer below as starting point, but using Stream instead of Iterator.
This allows to eliminate the need in tail recursion as per https://stackoverflow.com/a/10525539/165130, and have O(1) time for stream prepend #:: operation (while if we've concatenated iterators via ++ operation it would be O(n))
Note: While streams are lazily evaluated, Stream memoization may still cause memory blow up, and memory management gets tricky. Changing from val to def to define the Stream in def pages = readAllPages below doesn't seem to have any effect
def readAllPages(pageNumber: Int = 0): Stream[Iterator[Record]] = {
val iter: Iterator[Record] = readPage(pageNumber)
if (iter.isEmpty)
Stream.empty
else
iter #:: readAllPages(pageNumber + 1)
}
//usage
val pages = readAllPages
for{
page<-pages
record<-page
if(isValid(record))
}
process(record)
Edit 3:
the second suggestion by Tomer seems to be the best, its runtime and memory footprint is similar to the above solution but it is much more concise and error-prone.
val pages = Stream.from(1).map(readPage).takeWhile(_.nonEmpty)
Note: Stream.from(1) creates a stream starting from 1 and incrementing by 1, it's in API docs
You can try implement such logic:
def readPage(pageNumber: Int): Iterator[Record] = ???
#tailrec
def readAllPages(pageNumber: Int): Iterator[Iterator[Record]] = {
val iter = readPage(pageNumber)
if (iter.nonEmpty) {
// Compute on records
// When finishing computing:
Iterator(iter) ++ readAllPages(pageNumber + 1)
} else {
Iterator.empty
}
}
readAllPages(0)
A shorter option will be:
Stream.from(1).map(readPage).takeWhile(_.nonEmpty)
Doing some home project, I encountered an interested effect, which now , seems obvious to me, but still I do not see a way to get away from it.
That is the gist (I am using ScalaZ, but in haskell there would be probably the same result):
def askAndReadResponse(question: String): IO[String] = {
putStrLn(question) >> readLn
}
def core: IO[String] = {
val answer: IO[String] = askAndReadResponse("enter something")
val cond: IO[Boolean] = answer map {_.length > 2}
IO.ioMonad.ifM(cond, answer, core)
}
When I am trying to get an input from core, the askAndReadResponse evaluates twice - once for evaluating the condition, and then in ifM (so I have the message and readLn one more time then necessary).
What I need - just the validated value (to print it later, for instance)
Is there any elegant way to do this, in particular - to pass further the result of IO, without preceding IO actions, namely avoiding execution of askAndReadResponse twice?
You can sequence the effects using monadic binding with flatMap:
def core: IO[String] = askAndReadResponse("enter something").flatMap {
case response if response.length > 2 => response.point[IO]
case response => core
}
This lets you take the result of one computation (the user entering text after being prompted) and use it in subsequent computations (the calculation about whether to return or loop, and the result if returning).
ifM just isn't going to be useful in your case—it would only work here if your condition and your successful branch were independent computations.
In a Scala program I wrote I have a scala.collection.Map that maps a String to some calculated values (in detail it's Map[String, (Double, immutable.Map[String, Double], Double)] - I know that's ugly and should (and will be) wrapped). Now, if I do this:
stats.map { case(c, (prior, pwc, denom)) => {
println(c)
...
}
}
it takes about 30 seconds to print out roughly 50 times a value of c! The println is just a test statement - the actual calculation I need was even slower (I aborted after 1 minute of complete silence). However, if I do it like this:
stats.mapValues { case (prior, pwc, denom) => {
println(prior)
...
}
}
I don't run into these performance issues ... Can anyone explain why this is happening? Am I not following some important Scala guidelines?
Thanks for the help!
Edit:
I further investigated the behaviour. My guess is that the bottleneck comes from accessin the Map datastructure. If I do the following, I have have the same performance issues:
classes.foreach{c => {
println(c)
val ps = stats(c)
}
}
Here classes is a List[String] that stores the keys of the Map externally. Without the access to stats(c) no performance losses occur.
mapValues actually returns a view on the original map, which can lead to unexpected performance issues. From this blog post:
...here is a catch: map and mapValues are different in a not-so-subtle
way. mapValues, unlike map, returns a view on the original map. This
view holds references to both the original map and to the
transformation function (here (_ + 1)). Every time the returned map
(view) is queried, the original map is first queried and the
tranformation function is called on the result.
I recommend reading the rest of that post for some more details.
For an immutable flavour, Iterator does the job.
val x = Iterator.fill(100000)(someFn)
Now I want to implement a mutable version of Iterator, with three guarantees:
thread-safe on all transformations(fold, foldLeft, ..) and append
lazy evaluated
traversable only once! Once used, an object from this Iterator should be destroyed.
Is there an existing implementation to give me these guarantees? Any library or framework example would be great.
Update
To illustrate the desired behaviour.
class SomeThing {}
class Test(val list: Iterator[SomeThing]) {
def add(thing: SomeThing): Test = {
new Test(list ++ Iterator(thing))
}
}
(new Test()).add(new SomeThing).add(new SomeThing);
In this example, SomeThing is an expensive construct, it needs to be lazy.
Re-iterating over list is never required, Iterator is a good fit.
This is supposed to asynchronously and lazily sequence 10 million SomeThing instances without depleting the executor(a cached thread pool executor) or running out of memory.
You don't need a mutable Iterator for this, just daisy-chain the immutable form:
class SomeThing {}
case class Test(val list: Iterator[SomeThing]) {
def add(thing: => SomeThing) = Test(list ++ Iterator(thing))
}
(new Test()).add(new SomeThing).add(new SomeThing)
Although you don't really need the extra boilerplate of Test here:
Iterator(new SomeThing) ++ Iterator(new SomeThing)
Note that Iterator.++ takes a by-name param, so the ++ operation is already lazy.
You might also want to try this, to avoid building intermediate Iterators:
Iterator.continually(new SomeThing) take 2
UPDATE
If you don't know the size in advance, then I'll often use a tactic like this:
def mkSomething = if(cond) Some(new Something) else None
Iterator.continually(mkSomething) takeWhile (_.isDefined) map { _.get }
The trick is to have your generator function wrap its output in an Option, which then gives you a way to flag that the iteration is finished by returning None
Of course... If you're really pushing out the boat, you can even use the dreaded null:
def mkSomething = if(cond) { new Something } else null
Iterator.continually(mkSomething) takeWhile (_ != null)
Seems like you need to hide the fact that the iterator is mutable but at the same time allow it to grow mutably. What I'm going to propose is the same sort of trick I've used to speed up ::: in the past:
abstract class AppendableIterator[A] extends Iterator[A]{
protected var inner: Iterator[A]
def hasNext = inner.hasNext
def next() = inner next ()
def append(that: Iterator[A]) = synchronized{
inner = new JoinedIterator(inner, that)
}
}
//You might need to add some more things, this is a skeleton
class JoinedIterator[A](first: Iterator[A], second: Iterator[A]) extends Iterator[A]{
def hasNext = first.hasNext || second.hasNext
def next() = if(first.hasNext) first next () else if(second.hasNext) second next () else Iterator.next()
}
So what you're really doing is leaving the Iterator at whatever place in its iteration you might have it while still preserving the thread safety of the append by "joining" another Iterator in non-destructively. You avoid the need to recompute the two together because you never actually force them through a CanBuildFrom.
This is also a generalization of just adding one item. You can always wrap some A in an Iterator[A] of one element if you so choose.
Have you looked at the mutable.ParIterable in the collection.parallel package?
To access an iterator over elements you can do something like
val x = ParIterable.fill(100000)(someFn).iterator
From the docs:
Parallel operations are implemented with divide and conquer style algorithms that parallelize well. The basic idea is to split the collection into smaller parts until they are small enough to be operated on sequentially.
...
The higher-order functions passed to certain operations may contain side-effects. Since implementations of bulk operations may not be sequential, this means that side-effects may not be predictable and may produce data-races, deadlocks or invalidation of state if care is not taken. It is up to the programmer to either avoid using side-effects or to use some form of synchronization when accessing mutable data.
I have heard that iteratees are lazy, but how lazy exactly are they? Alternatively, can iteratees be fused with a postprocessing function, so that an intermediate data structure does not have to be built?
Can I in my iteratee for example build a 1 million item Stream[Option[String]] from a java.io.BufferedReader, and then subsequently filter out the Nones, in a compositional way, without requiring the entire Stream to be held in memory? And at the same time guarantee that I don't blow the stack? Or something like that - it doesn't have to use a Stream.
I'm currently using Scalaz 6 but if other iteratee implementations are able to do this in a better way, I'd be interested to know.
Please provide a complete solution, including closing the BufferedReader and calling unsafePerformIO, if applicable.
Here's a quick iteratee example using the Scalaz 7 library that demonstrates the properties you're interested in: constant memory and stack usage.
The problem
First assume we've got a big text file with a string of decimal digits on each line, and we want to find all the lines that contain at least twenty zeros. We can generate some sample data like this:
val w = new java.io.PrintWriter("numbers.txt")
val r = new scala.util.Random(0)
(1 to 1000000).foreach(_ =>
w.println((1 to 100).map(_ => r.nextInt(10)).mkString)
)
w.close()
Now we've got a file named numbers.txt. Let's open it with a BufferedReader:
val reader = new java.io.BufferedReader(new java.io.FileReader("numbers.txt"))
It's not excessively large (~97 megabytes), but it's big enough for us to see easily whether our memory use is actually staying constant while we process it.
Setting up our enumerator
First for some imports:
import scalaz._, Scalaz._, effect.IO, iteratee.{ Iteratee => I }
And an enumerator (note that I'm changing the IoExceptionOrs into Options for the sake of convenience):
val enum = I.enumReader(reader).map(_.toOption)
Scalaz 7 doesn't currently provide a nice way to enumerate a file's lines, so we're chunking through the file one character at a time. This will of course be painfully slow, but I'm not going to worry about that here, since the goal of this demo is to show that we can process this large-ish file in constant memory and without blowing the stack. The final section of this answer gives an approach with better performance, but here we'll just split on line breaks:
val split = I.splitOn[Option[Char], List, IO](_.cata(_ != '\n', false))
And if the fact that splitOn takes a predicate that specifies where not to split confuses you, you're not alone. split is our first example of an enumeratee. We'll go ahead and wrap our enumerator in it:
val lines = split.run(enum).map(_.sequence.map(_.mkString))
Now we've got an enumerator of Option[String]s in the IO monad.
Filtering the file with an enumeratee
Next for our predicate—remember that we said we wanted lines with at least twenty zeros:
val pred = (_: String).count(_ == '0') >= 20
We can turn this into a filtering enumeratee and wrap our enumerator in that:
val filtered = I.filter[Option[String], IO](_.cata(pred, true)).run(lines)
We'll set up a simple action that just prints everything that makes it through this filter:
val printAction = (I.putStrTo[Option[String]](System.out) &= filtered).run
Of course we've not actually read anything yet. To do that we use unsafePerformIO:
printAction.unsafePerformIO()
Now we can watch the Some("0946943140969200621607610...")s slowly scroll by while our memory usage remains constant. It's slow and the error handling and output are a little clunky, but not too bad I think for about nine lines of code.
Getting output from an iteratee
That was the foreach-ish usage. We can also create an iteratee that works more like a fold—for example gathering up the elements that make it through the filter and returning them in a list. Just repeat everything above up until the printAction definition, and then write this instead:
val gatherAction = (I.consume[Option[String], IO, List] &= filtered).run
Kick that action off:
val xs: Option[List[String]] = gatherAction.unsafePerformIO().sequence
Now go get a coffee (it might need to be pretty far away). When you come back you'll either have a None (in the case of an IOException somewhere along the way) or a Some containing a list of 1,943 strings.
Complete (faster) example that automatically closes the file
To answer your question about closing the reader, here's a complete working example that's roughly equivalent to the second program above, but with an enumerator that takes responsibility for opening and closing the reader. It's also much, much faster, since it reads lines, not characters. First for imports and a couple of helper methods:
import java.io.{ BufferedReader, File, FileReader }
import scalaz._, Scalaz._, effect._, iteratee.{ Iteratee => I, _ }
def tryIO[A, B](action: IO[B]) = I.iterateeT[A, IO, Either[Throwable, B]](
action.catchLeft.map(
r => I.sdone(r, r.fold(_ => I.eofInput, _ => I.emptyInput))
)
)
def enumBuffered(r: => BufferedReader) =
new EnumeratorT[Either[Throwable, String], IO] {
lazy val reader = r
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(reader.readLine())).flatMap {
case Right(null) => s.pointI
case Right(line) => k(I.elInput(Right(line))) >>== apply[A]
case e => k(I.elInput(e))
}
)
}
And now the enumerator:
def enumFile(f: File): EnumeratorT[Either[Throwable, String], IO] =
new EnumeratorT[Either[Throwable, String], IO] {
def apply[A] = (s: StepT[Either[Throwable, String], IO, A]) => s.mapCont(
k =>
tryIO(IO(new BufferedReader(new FileReader(f)))).flatMap {
case Right(reader) => I.iterateeT(
enumBuffered(reader).apply(s).value.ensuring(IO(reader.close()))
)
case Left(e) => k(I.elInput(Left(e)))
}
)
}
And we're ready to go:
val action = (
I.consume[Either[Throwable, String], IO, List] %=
I.filter(_.fold(_ => true, _.count(_ == '0') >= 20)) &=
enumFile(new File("numbers.txt"))
).run
Now the reader will be closed when the processing is done.
I should have read a little bit further... this is precisely what enumeratees are for. Enumeratees are defined in Scalaz 7 and Play 2, but not in Scalaz 6.
Enumeratees are for "vertical" composition (in the sense of "vertically integrated industry") while ordinary iteratees compose monadically in a "horizontal" way.