The following Scala code (on 2.9.2):
var a = ( 0 until 100000 ).toStream
for ( i <- 0 until 100000 )
{
val memTot = Runtime.getRuntime().totalMemory().toDouble / ( 1024.0 * 1024.0 )
println( i, a.size, memTot )
a = a.map(identity)
}
uses an ever increasing amount of memory on every iteration of the loop. If a is defined as ( 0 until 100000 ).toList, then the memory usage is stable (give or take GC).
I understand that streams evaluate lazily but retain elements once they are generated. But it appears that in my code above, each new stream (generated by the last line of code) somehow keeps a reference to previous streams. Can someone help explain?
Here is what happens. Stream is always evaluated lazily but already calculated elements are "cached" for later. Lazy evaluation is crucial. Look at this piece of code:
a = a.flatMap( v => Some( v ) )
Although it looks as if you were transforming one Stream to another and discarding the old one, this is not what happens. The new Stream still keeps a reference to the old one. That's because result Stream should not eagerly compute all elements of underlying stream but do that on demand. Take this as an example:
io.Source.fromFile("very-large.file").getLines().toStream.
map(_.trim).
filter(_.contains("X")).
map(_.substring(0, 10)).
map(_.toUpperCase)
You can chain as many operations as you want, but file is barely touched to read first line. Each subsequent operation just wraps the previous Stream, holding a reference to child stream. The moment you ask for size or do foreach, evaluation starts.
Back to your code. In the second iteration you create third stream, holding a reference to the second one, which in turns keeps a reference to the one you initially defined. Basically you have a stack of pretty big objects growing.
But this doesn't explain why memory leaks so fast. The crucial part is... println(), or a.size to be precise. Without printing (and thus evaluating the whole Stream) Stream remains "unevaluated". Unevaluated stream doesn't cache any values, so it's very slim. Memory would still leak due to growing chain of streams in one another, but much, much slower.
This begs a questions: why it works with toList It's quite simple. List.map() eagerly creates new List. Period. The previous one is no longer referenced and eligible for GC.
Related
I have the following code (simplification for a complex situation):
val newRDD = prevRDD.flatMap{a =>
Array.fill[Int](scala.util.Random.nextInt(10)){scala.util.Random.nextInt(2)})
}.persist()
val a = newRDD.count
val b = newRDD.count
and even that the RDD supposed to be persisted (and therefore consistent), a and b are not identical in most cases.
Is there a way to keep the results of the first action consistent, so when the second "action" will be called, the results of the first action will be returned?
* Edit *
The issue that I have is apparently caused by zipWithIndex method exists in my code - which creates indices higher than the count. I'll ask about it in a different thread. Thanks
There is no way to make sure 100% consistent.
When you call persist it will try to cache all of partitions on memory if it fits.
Otherwise, It will recompute partitions which are not fit on memory.
This is a follow-up to my previous question.
I understand that we can use streams to generate an approximation of 'pi' (and other numbers), n-th fibonacci, etc. However I doubt if streams is the right approach to do that.
The main drawback (as I see it) is memory consumption: e.g. stream will retains all fibonacci numbers for i < n while I need only fibonacci n-th. Of course, I can use drop but it makes the solution a bit more complicated. The tail recursion looks like a more suitable approach to the tasks like that.
What do you think?
If need to go fast, travel light. That means; avoid allocation of any unneccessary memory. If you need memory, use the fastast collections available. If you know how much memory you need; preallocate. Allocation is the absolute performance killer... for calculation. Your code may not look nice anymore, but it will go fast.
However, if you're working with IO (disk, network) or any user interaction then allocation pales. It's then better to shift priority from code performance to maintainability.
Use Iterator. It does not retain intermediate values.
If you want n-th fibonacci number and use a stream just as a temporary data structure (if you do not hold references to previously computed elements of stream) then your algorithm would run in constant space.
Previously computed elements of a Stream (which are not used anymore) are going to be garbage collected. And as they were allocated in the youngest generation and immediately collected, allmost all allocations might be in cache.
Update:
It seems that the current implementation of Stream is not as space-efficient as it may be, mainly because it inherits an implementation of apply method from LinearSeqOptimized trait, where it is defined as
def apply(n: Int): A = {
val rest = drop(n)
if (n < 0 || rest.isEmpty) throw new IndexOutOfBoundsException("" + n)
rest.head
}
Reference to a head of a stream is hold here by this and prevents the stream from being gc'ed. So combination of drop and head methods (as in f.drop(100).head) may be better for situations where dropping intermediate results is feasible. (thanks to Sebastien Bocq for explaining this stuff on scala-user).
As I understand, Stream retains the recently evaluated elements. I guess it does not retain all evaluated elements (it is not feasible), so it probably uses some internal "cache".
Is it correct? Can I control the size and policies of this cache?
Streams are like lists that generate their members as they are required. Once an element has been generated, it is retained in the stream and reused.
For example:
lazy val naturals: Stream[Int] = Stream.cons(0, naturals.map{_ + 1})
will give you a stream of the natural numbers. If I call
naturals(5)
it will generate elements 0-5 and return 5, if I then call
naturals(8)
It will reuse the first 6 elements and generate 3 more.
If you are concerned about memory usage, you can use Stream.drop(num) to produce a new stream with num elements removed from the beginning, allowing the truncated elements to be garbage collected with the old stream. For example:
naturals(5) //returns 5
val truncated = naturals.drop(4)
truncated(5) //returns 9
The Stream-object retains all references that have been evaluated/accessed so far. Stream works like a List. Every element that can be reached from a held reference, and which has already been accessed at least once, won't be garbage collected.
So basically your pointers into the stream and what you have evaluated so far define what will get cached.
I want to use parallel arrays for a task, and before I start with the coding, I'd be interested in knowing if this small snipept is threadsafe:
import collection.mutable._
var listBuffer = ListBuffer[String]("one","two","three","four","five","six","seven","eight","nine")
var jSyncList = java.util.Collections.synchronizedList(new java.util.ArrayList[String]())
listBuffer.par.foreach { e =>
println("processed :"+e)
// using sleep here to simulate a random delay
Thread.sleep((scala.math.random * 1000).toLong)
jSyncList.add(e)
}
jSyncList.toArray.foreach(println)
Are there better ways of processing something with parallel collections, and acumulating the results elsewhere?
The code you posted is perfectly safe; I'm not sure about the premise though: why do you need to accumulate the results of a parallel collection in a non-parallel one? One of the whole points of the parallel collections is that they look like other collections.
I think that parallel collections also will provide a seq method to switch to sequential ones. So you should probably use this!
For this pattern to be safe:
listBuffer.par.foreach { e => f(e) }
f has to be able to run concurrently in a safe way. I think the same rules that you need for safe multi-threading apply (access to share state needs to be thread safe, the order of the f calls for different e won't be deterministic and you may run into deadlocks as you start synchronizing your statements in f).
Additionally I'm not clear what guarantees the parallel collections gives you about the underlying collection being modified while being processed, so a mutable list buffer which can have elements added/removed is possibly a poor choice. You never know when the next coder will call something like foo(listBuffer) before your foreach and pass that reference to another thread which may mutate the list while it's being processed.
Other than that, I think for any f that will take a long time, can be called concurrently and where e can be processed out of order, this is a fine pattern.
immutCol.par.foreach { e => threadSafeOutOfOrderProcessingOf(e) }
disclaimer: I have not tried // colls myself, but I'm looking forward at having SO questions/answers show us what works well.
The synchronisedList should be safe, though the println may give unexpected results - you have no guarantees of the order that items will be printed, or even that your printlns won't be interleaved mid-character.
A synchronised list is also unlikely to be the fastest way you can do this, a safer solution is to map over an immutable collection (Vector is probably your best bet here), then print all the lines (in order) afterwards:
val input = Vector("one","two","three","four","five","six","seven","eight","nine")
val output = input.par.map { e =>
val msg = "processed :" + e
// using sleep here to simulate a random delay
Thread.sleep((math.random * 1000).toLong)
msg
}
println(output mkString "\n")
You'll also note that this code has about as much practical usefulness as your example :)
This code is plain weird -- why add stuff in parallel to something that needs to be synchronized? You'll add contention and gain absolutely nothing in return.
The principle of the thing -- accumulating results from parallel processing, are better achieved with stuff like fold, reduce or aggregate.
The code you've posted is safe - there will be no errors due to inconsistent state of your array list, because access to it is synchronized.
However, parallel collections process items concurrently (at the same time), AND out-of-order. The out-of-order means that the 54. element may be processed before the 2. element - your synchronized array list will contain items in non-predefined order.
In general it's better to use map, filter and other functional combinators to transform a collection into another collection - these will ensure that the ordering guarantees are preserved if a collection has some (like Seqs do). For example:
ParArray(1, 2, 3, 4).map(_ + 1)
always returns ParArray(2, 3, 4, 5).
However, if you need a specific thread-safe collection type such as a ConcurrentSkipListMap or a synchronized collection to be passed to some method in some API, modifying it from a parallel foreach is safe.
Finally, a note - parallel collection provide parallel bulk operations on data. Mutable parallel collections are not thread-safe in the sense that you can add elements to them from different threads. Mutable operations like insertion to a map or appending a buffer still have to be synchronized.
In the Scala 2.8 collections framework, what is the difference between view and toStream?
In a view elements are recomputed each time they are accessed. In a stream elements are retained as they are evaluated.
For example:
val doubled = List(1,2,3,4,5,6,7,8,9,10).view.map(_*2)
println(doubled.mkString(" "))
println(doubled.mkString(" "))
will re-evaluate the map for each element twice. Once for the first println, and again for the second. In contrast
val doubled = List(1,2,3,4,5,6,7,8,9,10).toStream.map(_*2)
println(doubled.mkString(" "))
println(doubled.mkString(" "))
will only double the elements once.
A view is like a recipe to create a collection. When you ask for elements of a view it carries out the recipe each time.
A stream is like a guy with a bunch of dry-erase cards. The guy knows how to compute subsequent elements of the collection. You can ask him for the next element of the collection and gives you a card with the element written on it and a string tied from the card to his finger (to help him remember). Also, before he gives you a card he unties the first string from his finger and ties it to the new card.
If you hold onto the first card (i.e. keep a reference to the head of the stream) you might eventually run out of cards (i.e. memory) when you ask for the next element, but if you don't need to go back to the first elements you can cut the string and hand the unneeded cards back to the guy and he can re-use them (they're dry-erase afterall). This is how a stream can represent an infinite sequence without running out of memory.
Geoff's answer covers almost everything, but I want to add that a Stream is a List-like sequence, while every kind of collections (maps, sets, indexed seqs) have views.
Another way to explain this if you know apache spark would be that using stream is like caching the spark dataset, whereas using view is like using an uncached dataset, meaning that every time you call some action on it, it will reevaluate everything in the DAG.