Stream#filter Runs out of Memory for 1,000,000 items - scala

Let's say I have a Stream of length 1,000,000 with all 1's.
scala> val million = Stream.fill(100000000)(1)
million: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> million filter (x => x % 2 == 0)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
I get an Out of Memory exception.
Then, I tried the same filter call with List.
scala> val y = List.fill(1000000)(1)
y: List[Int] = List(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ...
scala> y.filter(x => x % 2 == 0)
res2: List[Int] = List()
Yet it succeeds.
Why does the Stream#filter run out of memory here, but the List#filter completes just fine?
Lastly, with a large stream, will filter result in the non-lazy evaluation of the entire stream?

Overhead of List - single object (instance of ::) with 2 fields (2 pointers) per element.
Overhead of Stream - instance of Cons (with 3 pointers) plus an instance of Function (tl: => Stream[A]) for lazy evaluation of Stream#tail per element.
So you'll spend ~2 times more memory on Stream.
You have defined your Stream as val. Alternatively you could define million as def - in this case after filter GC will delete all created elements and you'll get your memory back.
Note that only tail in Stream is lazy, head is strict, so filter evaluates strictly until it gets first element that satisfies a given predicate, and since there is no such elements in your Stream filter iterates over all your million stream and puts all elements in memory.

Related

scalaz stream structure for growing lists

I have a hunch that I can (should?) be using scalaz-streams for solving my problem which is like this.
I have a starting item A. I have a function that takes an A and returns a list of A.
def doSomething(a : A) : List[A]
I have a work queue that starts with 1 item (the starting item). When we process (doSomething) each item it may add many items to the end of the same work queue. At some point however (after many million items) each subsequent item that we doSomething on will start adding less and less items to the work queue and eventually no new items will be added (doSomething will return Nil for these items). This is how we know the computation will eventually terminate.
Assuming scalaz-streams is appropriate for this could something please give me some tips as to which overall structure or types I should be looking at to implement this?
Once a simple implementation with a single "worker" is done, I would also like to use multiple workers to process queue items in parallel, e.g. having a pool of 5 workers (and each worker would be farming its task to an agent to calculate doSomething) so I would need to handle effects (such as worker failure) as well in this algorithm.
So the answer to the "how?" is :
import scalaz.stream._
import scalaz.stream.async._
import Process._
def doSomething(i: Int) = if (i == 0) Nil else List(i - 1)
val q = unboundedQueue[Int]
val out = unboundedQueue[Int]
q.dequeue
.flatMap(e => emitAll(doSomething(e)))
.observe(out.enqueue)
.to(q.enqueue).run.runAsync(_ => ()) //runAsync can process failures, there is `.onFailure` as well
q.enqueueAll(List(3,5,7)).run
q.size.continuous
.filter(0==)
.map(_ => -1)
.to(out.enqueue).once.run.runAsync(_ => ()) //call it only after enqueueAll
import scalaz._, Scalaz._
val result = out
.dequeue
.takeWhile(_ != -1)
.map(_.point[List])
.foldMonoid.runLast.run.get //run synchronously
Result:
result: List[Int] = List(2, 4, 6, 1, 3, 5, 0, 2, 4, 1, 3, 0, 2, 1, 0)
However, you might notice that:
1) I had to solve termination problem . Same problem for akka-stream and much harder to resolve there as you don't have access to the Queue and no natural back-pressure to guarantee that queue won't be empty just because of fast-readers.
2) I had to introduce another queue for the output (and convert it to the List) as working one is becoming empty at the end of computation.
So, both libraries are not much adapted to such requirements (finite stream), however scalaz-stream (which is going to became "fs2" after removing scalaz dependency) is flexible enough to implement your idea. The big "but" about that is it's gonna be run sequentially by default. There is (at least) two ways to make it faster:
1) split your doSomething into stages, like .flatMap(doSomething1).flatMap(doSomething2).map(doSomething3) and then put another queues between them (about 3 times faster if stages taking equal time).
2) parallelize queue processing. Akka has mapAsync for that - it can do maps in parallel automatically. Scalaz-stream has chunks - you can group your q into chunks of let's say 5 and then process each element inside chunk in parallel. Anyway both solutions (akka vs scalaz) aren't much adapted for using one queue as both input and ouput.
But, again, it's all too complex and pointless as there is a classic simple way:
#tailrec def calculate(l: List[Int], acc: List[Int]): List[Int] =
if (l.isEmpty) acc else {
val processed = l.flatMap(doSomething)
calculate(processed, acc ++ processed)
}
scala> calculate(List(3,5,7), Nil)
res5: List[Int] = List(2, 4, 6, 1, 3, 5, 0, 2, 4, 1, 3, 0, 2, 1, 0)
And here is the parallelized one:
#tailrec def calculate(l: List[Int], acc: List[Int]): List[Int] =
if (l.isEmpty) acc else {
val processed = l.par.flatMap(doSomething).toList
calculate(processed, acc ++ processed)
}
scala> calculate(List(3,5,7), Nil)
res6: List[Int] = List(2, 4, 6, 1, 3, 5, 0, 2, 4, 1, 3, 0, 2, 1, 0)
So, yes I would say that neither scalaz-stream nor akka-streams fits into your requirements; however classic scala parallel collections fit perfectly.
If you need distributed calculations across multiple JVMs - take a look at Apache Spark, its scala-dsl uses the same map/flatMap/fold style. It allows you to work with big collections (by scaling them across JVM's), that don't fit into JVM's memory, so you can improve #tailrec def calculate by using RDD instead of List. It will also give you intruments to process failures inside doSomething.
P.S. So here is why I don't like the idea of using streaming libraries for such tasks. Streaming is more like about infinite streams coming from some external systems (like HttpRequests) not about computation of predefined (even big) data.
P.S.2 If you need reactive-like (without blocking) you might use Future (or scalaz.concurrent.Task) + Future.sequence

Seq with maximal elements

I have a Seq and function Int => Int. What I need to achieve is to take from original Seq only thoose elements that would be equal to the maximum of the resulting sequence (the one, I'll have after applying given function):
def mapper:Int=>Int= x=>x*x
val s= Seq( -2,-2,2,2 )
val themax= s.map(mapper).max
s.filter( mapper(_)==themax)
But this seems wasteful, since it has to map twice (once for the filter, other for the maximum).
Is there a better way to do this? (without using a cycle, hopefully)
EDIT
The code has since been edited; in the original this was the filter line: s.filter( mapper(_)==s.map(mapper).max). As om-nom-nom has pointed out, this evaluates `s.map(mapper).max each (filter) iteration, leading to quadratic complexity.
Here is a solution that does the mapping only once and using the `foldLeft' function:
The principle is to go through the seq and for each mapped element if it is greater than all mapped before then begin a new sequence with it, otherwise if it is equal return the list of all maximums and the new mapped max. Finally if it is less then return the previously computed Seq of maximums.
def getMaxElems1(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = s.foldLeft(Seq[(Int,Int)]())((res, elem) => {
val e2 = mapper(elem)
if(res.isEmpty || e2>res.head._2)
Seq((elem,e2))
else if (e2==res.head._2)
res++Seq((elem,e2))
else res
}).map(_._1) // keep only original elements
// test with your list
scala> getMaxElems1(s)(mapper)
res14: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems1(Seq(-1, 2,0, -2, 1,-2))(mapper)
res15: Seq[Int] = List(2, -2, -2)
Remark: About complexity
The algorithm I present above has a complexity of O(N) for a list with N elements. However:
the operation of mapping all elements is of complexity O(N)
the operation of computing the max is of complexity O(N)
the operation of zipping is of complexity O(N)
the operation of filtering the list according to the max is also of complexity O(N)
the operation of mapping all elements is of complexity O(M), with M the number of final elements
So, finally the algorithm you presented in your question has the same complexity (quality) than my answer's one, moreover the solution you present is more clear than mine. So, even if the 'foldLeft' is more powerful, for this operation I would recommend your idea, but with zipping original list and computing the map only once (especially if your map is more complicated than a simple square). Here is the solution computed with the help of *scala_newbie* in question/chat/comments.
def getMaxElems2(s:Seq[Int])(mapper:Int=>Int):Seq[Int] = {
val mappedS = s.map(mapper) //map done only once
val m = mappedS.max // find the max
s.zip(mappedS).filter(_._2==themax).unzip._1
}
// test with your list
scala> getMaxElems2(s)(mapper)
res16: Seq[Int] = List(-2, -2, 2, 2)
//test with a list containing also non maximal elements
scala> getMaxElems2(Seq(-1, 2,0, -2, 1,-2))(mapper)
res17: Seq[Int] = List(2, -2, -2)

Dividing Scala iterators leads to GCoverhead/JavaHeapSpace problems

I am processing large data with Scala, so memory and time is an even more important companion than it usually is to me. I am trying to increase the speed of some evaluation by subdividing the initial Iterator[String] obtained by getLines on a large source file in order to make some subevaluation in parallel and merge the results. I do this by recursively slice-ing the iterator into two halfs and recalling the recursive function on each subiterator.
Now, I am wondering why I get GCoverhead or JavaHeapSpace exception, although the "critical" elements are only evaluated once before the recursion step (in order to get the size of the iterator), but in my opinion not in the recursion step, because slice returns an iterator again (which is non-strict by implementation). The following (reduced!) code will fail applied on a ~15g file before concatenating the sublists.
I use .duplicatein each step. I looked up the api, the doc of .duplicate says "The implementation may allocate temporary storage for elements iterated by one iterator but not yet by the other.", but no element has been iterated yet. Could someone give me a hint what is going wrong there and how to solve this problem? Thank you so much!
type itType = Iterator[String]
def src = io.Source.fromFile(args(0)).getLines
// recursively divide into equal size blocks in divide&conquer fashion
def getSubItsDC(it: itType, depth: Int = 4) = {
println("Getting length of file..")
val totalSize = src.length
println(totalSize)
def rec(it_rec: itType = it, depth_rec: Int = depth, size: Int = totalSize):
List[itType] = depth_rec match {
case n if n > 0 =>
println(n)
val (it1, it2) = it_rec.duplicate
val newSize = size/2
rec(it1 slice (0,newSize), n-1, newSize) ++
rec(it2 slice (newSize,size), n-1, newSize)
case n if n == 0 => List(it_rec)
}
println("Starting recursion..")
rec()
}
getSubItsDC(src)
In the REPL the code runs equally fast with arbitrary size of iterators (when hard coding the totalSize), thus I assumed correct lazyness.
I think you might be better off using the itr grouped size to get an Iterator[Iterator[String]] (a GroupedIterator):
scala> val itr = (1 to 100000000).iterator grouped 1000000
itr: Iterator[Int]#GroupedIterator[Int] = non-empty iterator
This will allow you to chunk the processing of parts of your file.
Why your solution uses too much memory
Duplicating an Iterator is obviously an operation which means that the Iterator may have to cache its computed values. For example:
scala> val itr = (1 to 100000000).iterator
itr: Iterator[Int] = non-empty iterator
scala> itr filter (_ % 10000000 == 0) foreach println
10000000
....
100000000
But when I take a duplicate:
scala> val (a, b) = (1 to 100000000).iterator.duplicate
a: Iterator[Int] = non-empty iterator
b: Iterator[Int] = non-empty iterator
scala> a filter (_ % 10000000 == 0) foreach println
//oh dear, garbage collecting
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
In this example, as I run through a, in order that b be a duplicate, the elements that a has iterated over but which b has not, need to be cached

Scala: what is the most appropriate data structure for sorted subsets?

Given a large collection (let's call it 'a') of elements of type T (say, a Vector or List) and an evaluation function 'f' (say, (T) => Double) I would like to derive from 'a' a result collection 'b' that contains the N elements of 'a' that result in the highest value under f. The collection 'a' may contain duplicates. It is not sorted.
Maybe leaving the question of parallelizability (map/reduce etc.) aside for a moment, what would be the appropriate Scala data structure for compiling the result collection 'b'? Thanks for any pointers / ideas.
Notes:
(1) I guess my use case can be most concisely expressed as
val a = Vector( 9,2,6,1,7,5,2,6,9 ) // just an example
val f : (Int)=>Double = (n)=>n // evaluation function
val b = a.sortBy( f ).take( N ) // sort, then clip
except that I do not want to sort the entire set.
(2) one option might be an iteration over 'a' that fills a TreeSet with 'manual' size bounding (reject anything worse than the worst item in the set, don't let the set grow beyond N). However, I would like to retain duplicates present in the original set in the result set, and so this may not work.
(3) if a sorted multi-set is the right data structure, is there a Scala implementation of this? Or a binary-sorted Vector or Array, if the result set is reasonably small?
You can use a priority queue:
def firstK[A](xs: Seq[A], k: Int)(implicit ord: Ordering[A]) = {
val q = new scala.collection.mutable.PriorityQueue[A]()(ord.reverse)
val (before, after) = xs.splitAt(k)
q ++= before
after.foreach(x => q += ord.max(x, q.dequeue))
q.dequeueAll
}
We fill the queue with the first k elements and then compare each additional element to the head of the queue, swapping as necessary. This works as expected and retains duplicates:
scala> firstK(Vector(9, 2, 6, 1, 7, 5, 2, 6, 9), 4)
res14: scala.collection.mutable.Buffer[Int] = ArrayBuffer(6, 7, 9, 9)
And it doesn't sort the complete list. I've got an Ordering in this implementation, but adapting it to use an evaluation function would be pretty trivial.

Incrementing the for loop (loop variable) in scala by power of 5

I had asked this question on Javaranch, but couldn't get a response there. So posting it here as well:
I have this particular requirement where the increment in the loop variable is to be done by multiplying it with 5 after each iteration. In Java we could implement it this way:
for(int i=1;i<100;i=i*5){}
In scala I was trying the following code-
var j=1
for(i<-1.to(100).by(scala.math.pow(5,j).toInt))
{
println(i+" "+j)
j=j+1
}
But its printing the following output:
1 1
6 2
11 3
16 4
21 5
26 6
31 7
36 8
....
....
Its incrementing by 5 always. So how do I got about actually multiplying the increment by 5 instead of adding it.
Let's first explain the problem. This code:
var j=1
for(i<-1.to(100).by(scala.math.pow(5,j).toInt))
{
println(i+" "+j)
j=j+1
}
is equivalent to this:
var j = 1
val range: Range = Predef.intWrapper(1).to(100)
val increment: Int = scala.math.pow(5, j).toInt
val byRange: Range = range.by(increment)
byRange.foreach {
println(i+" "+j)
j=j+1
}
So, by the time you get to mutate j, increment and byRange have already been computed. And Range is an immutable object -- you can't change it. Even if you produced new ranges while you did the foreach, the object doing the foreach would still be the same.
Now, to the solution. Simply put, Range is not adequate for your needs. You want a geometric progression, not an arithmetic one. To me (and pretty much everyone else answering, it seems), the natural solution would be to use a Stream or Iterator created with iterate, which computes the next value based on the previous one.
for(i <- Iterator.iterate(1)(_ * 5) takeWhile (_ < 100)) {
println(i)
}
EDIT: About Stream vs Iterator
Stream and Iterator are very different data structures, that share the property of being non-strict. This property is what enables iterate to even exist, since this method is creating an infinite collection1, from which takeWhile will create a new2 collection which is finite. Let's see here:
val s1 = Stream.iterate(1)(_ * 5) // s1 is infinite
val s2 = s1.takeWhile(_ < 100) // s2 is finite
val i1 = Iterator.iterate(1)(_ * 5) // i1 is infinite
val i2 = i1.takeWhile(_ < 100) // i2 is finite
These infinite collections are possible because the collection is not pre-computed. On a List, all elements inside the list are actually stored somewhere by the time the list has been created. On the above examples, however, only the first element of each collection is known in advance. All others will only be computed if and when required.
As I mentioned, though, these are very different collections in other respects. Stream is an immutable data structure. For instance, you can print the contents of s2 as many times as you wish, and it will show the same output every time. On the other hand, Iterator is a mutable data structure. Once you used a value, that value will be forever gone. Print the contents of i2 twice, and it will be empty the second time around:
scala> s2 foreach println
1
5
25
scala> s2 foreach println
1
5
25
scala> i2 foreach println
1
5
25
scala> i2 foreach println
scala>
Stream, on the other hand, is a lazy collection. Once a value has been computed, it will stay computed, instead of being discarded or recomputed every time. See below one example of that behavior in action:
scala> val s2 = s1.takeWhile(_ < 100) // s2 is finite
s2: scala.collection.immutable.Stream[Int] = Stream(1, ?)
scala> println(s2)
Stream(1, ?)
scala> s2 foreach println
1
5
25
scala> println(s2)
Stream(1, 5, 25)
So Stream can actually fill up the memory if one is not careful, whereas Iterator occupies constant space. On the other hand, one can be surprised by Iterator, because of its side effects.
(1) As a matter of fact, Iterator is not a collection at all, even though it shares a lot of the methods provided by collections. On the other hand, from the problem description you gave, you are not really interested in having a collection of numbers, just in iterating through them.
(2) Actually, though takeWhile will create a new Iterator on Scala 2.8.0, this new iterator will still be linked to the old one, and changes in one have side effects on the other. This is subject to discussion, and they might end up being truly independent in the future.
In a more functional style:
scala> Stream.iterate(1)(i => i * 5).takeWhile(i => i < 100).toList
res0: List[Int] = List(1, 5, 25)
And with more syntactic sugar:
scala> Stream.iterate(1)(_ * 5).takeWhile(_ < 100).toList
res1: List[Int] = List(1, 5, 25)
Maybe a simple while-loop would do?
var i=1;
while (i < 100)
{
println(i);
i*=5;
}
or if you want to also print the number of iterations
var i=1;
var j=1;
while (i < 100)
{
println(j + " : " + i);
i*=5;
j+=1;
}
it seems you guys likes functional so how about a recursive solution?
#tailrec def quints(n:Int): Unit = {
println(n);
if (n*5<100) quints(n*5);
}
Update: Thanks for spotting the error... it should of course be power, not multiply:
Annoyingly, there doesn't seem to be an integer pow function in the standard library!
Try this:
def pow5(i:Int) = math.pow(5,i).toInt
Iterator from 1 map pow5 takeWhile (100>=) toList
Or if you want to use it in-place:
Iterator from 1 map pow5 takeWhile (100>=) foreach {
j => println("number:" + j)
}
and with the indices:
val iter = Iterator from 1 map pow5 takeWhile (100>=)
iter.zipWithIndex foreach { case (j, i) => println(i + " = " + j) }
(0 to 2).map (math.pow (5, _).toInt).zipWithIndex
res25: scala.collection.immutable.IndexedSeq[(Int, Int)] = Vector((1,0), (5,1), (25,2))
produces a Vector, with i,j in reversed order.