As I understand, Stream retains the recently evaluated elements. I guess it does not retain all evaluated elements (it is not feasible), so it probably uses some internal "cache".
Is it correct? Can I control the size and policies of this cache?
Streams are like lists that generate their members as they are required. Once an element has been generated, it is retained in the stream and reused.
For example:
lazy val naturals: Stream[Int] = Stream.cons(0, naturals.map{_ + 1})
will give you a stream of the natural numbers. If I call
naturals(5)
it will generate elements 0-5 and return 5, if I then call
naturals(8)
It will reuse the first 6 elements and generate 3 more.
If you are concerned about memory usage, you can use Stream.drop(num) to produce a new stream with num elements removed from the beginning, allowing the truncated elements to be garbage collected with the old stream. For example:
naturals(5) //returns 5
val truncated = naturals.drop(4)
truncated(5) //returns 9
The Stream-object retains all references that have been evaluated/accessed so far. Stream works like a List. Every element that can be reached from a held reference, and which has already been accessed at least once, won't be garbage collected.
So basically your pointers into the stream and what you have evaluated so far define what will get cached.
Related
I'm looking for a data structure that I can use for Snapshot (a sequence) in the following example:
val oldSnapshot = Snapshot(3, 2, 1)
val newSnapshot = 4 +: (oldSnapshot dropRight 1) // basically a shift
// newSnapshot is now passed to another function that also knows the old snapshot
// this check should be fast if true
if (newSnapshot.tail == (oldSnapshot dropRight 1))
Background: I need an immutable data structure that stores a snapshot of the last n items that appeared in a stream. It is updated when a new item appears in the stream and the oldest item is dropped so the length is always at most n and the snapshots resemble a sliding window on the last n elements. In rare cases the stream can be interrupted and restarted. In case of a restart, the stream first emits at least n older elements before it continues to emit new "live" elements. However, some elements may be different, so I cannot be sure that a new snapshot of the recent history can be derived from an older snapshot just by appending new elements.
I further have a component that consumes a stream of these snapshots and does some incremental processing of the elements. It might for instance keep track of the sum of the elements. For a new snapshot it has to decide whether it was derived by appending one or a few elements to the end of the last known snapshot (and dropping the oldest elements) so it doesn't have to process all the old items again but can reuse some intermediate results.
By far the most common case is that the snapshot was shifted to include a single new element and the oldest was dropped. This case should be recognized really fast. If I would keep track of the whole history of elements and not drop the oldest, I could use a List and just compare the tail of the new list to the last known list. Internally, it would compare object identity and in most of the cases this would be enough to see that the lists are identical.
I'm thinking about using a Vector or a similar data structure for the snapshots and I'm wondering if such comparisons would also be guaranteed to be efficient in this sense or whether there is perhaps a better suited data structure that internally uses object identity checks for subcollections to quickly determine wheter two instances are identical.
I have the following code (simplification for a complex situation):
val newRDD = prevRDD.flatMap{a =>
Array.fill[Int](scala.util.Random.nextInt(10)){scala.util.Random.nextInt(2)})
}.persist()
val a = newRDD.count
val b = newRDD.count
and even that the RDD supposed to be persisted (and therefore consistent), a and b are not identical in most cases.
Is there a way to keep the results of the first action consistent, so when the second "action" will be called, the results of the first action will be returned?
* Edit *
The issue that I have is apparently caused by zipWithIndex method exists in my code - which creates indices higher than the count. I'll ask about it in a different thread. Thanks
There is no way to make sure 100% consistent.
When you call persist it will try to cache all of partitions on memory if it fits.
Otherwise, It will recompute partitions which are not fit on memory.
Suppose I want to groupBy on a iterator, compiler asks to "value groupBy is not a member of Iterator[Int]". One way would be to convert iterator to list which I want to avoid. I want to do the groupBy such that the input is Iterator[A] and output is Map[B, Iterator[A]]. Such that the part of the iterator is loaded only when that part of element is accessed and not loading the whole list into memory. I also know the possible set of keys, so I can say whether a particular key exists.
def groupBy(iter: Iterator[A], f: fun(A)->B): Map[B, Iterator[A]] = {
.........
}
One possibility is, you can convert Iterator to view and then groupBy as,
iter.toTraversable.view.groupBy(_.whatever)
I don't think this is doable without storing results in memory (and in this case switching to a list would be much easier). Iterator implies that you can make only one pass over the whole collection.
For instance let's say you have a sequence 1 2 3 4 5 6 and you want to groupBy odd an even numbers:
groupBy(it, v => v % 2 == 0)
Then you could query the result with either true and false to get an iterator. The problem should you loop one of those two iterators till the end you couldn't do the same thing for the other one (as you cannot reset an iterator in Scala).
This would be doable should the elements were sorted according to the same rule you're using in groupBy.
As said in other responses, the only way to achieve a lazy groupBy on Iterator is to internally buffer elements. The worst case for the memory will be in O(n). If you know in advance that the keys are well distributed in your iterator, the buffer can be a viable solution.
The solution is relatively complex, but a good start are some methods from the Iterator trait in the Scala source code:
The partition method that uses both the buffered method to keep the head value in memory, and two internal queues (lookahead) for each of the produced iterators.
The span method with also the buffered method and this time a unique queue for the leading iterator.
The duplicate method. Perhaps less interesting, but we can again observe another use of a queue to store the gap between the two produced iterators.
In the groupBy case, we will have a variable number of produced iterators instead of two in the above examples. If requested, I can try to write this method.
Note that you have to know the list of keys in advance. Otherwise, you will need to traverse (and buffer) the entire iterator to collect the different keys to build your Map.
The following Scala code (on 2.9.2):
var a = ( 0 until 100000 ).toStream
for ( i <- 0 until 100000 )
{
val memTot = Runtime.getRuntime().totalMemory().toDouble / ( 1024.0 * 1024.0 )
println( i, a.size, memTot )
a = a.map(identity)
}
uses an ever increasing amount of memory on every iteration of the loop. If a is defined as ( 0 until 100000 ).toList, then the memory usage is stable (give or take GC).
I understand that streams evaluate lazily but retain elements once they are generated. But it appears that in my code above, each new stream (generated by the last line of code) somehow keeps a reference to previous streams. Can someone help explain?
Here is what happens. Stream is always evaluated lazily but already calculated elements are "cached" for later. Lazy evaluation is crucial. Look at this piece of code:
a = a.flatMap( v => Some( v ) )
Although it looks as if you were transforming one Stream to another and discarding the old one, this is not what happens. The new Stream still keeps a reference to the old one. That's because result Stream should not eagerly compute all elements of underlying stream but do that on demand. Take this as an example:
io.Source.fromFile("very-large.file").getLines().toStream.
map(_.trim).
filter(_.contains("X")).
map(_.substring(0, 10)).
map(_.toUpperCase)
You can chain as many operations as you want, but file is barely touched to read first line. Each subsequent operation just wraps the previous Stream, holding a reference to child stream. The moment you ask for size or do foreach, evaluation starts.
Back to your code. In the second iteration you create third stream, holding a reference to the second one, which in turns keeps a reference to the one you initially defined. Basically you have a stack of pretty big objects growing.
But this doesn't explain why memory leaks so fast. The crucial part is... println(), or a.size to be precise. Without printing (and thus evaluating the whole Stream) Stream remains "unevaluated". Unevaluated stream doesn't cache any values, so it's very slim. Memory would still leak due to growing chain of streams in one another, but much, much slower.
This begs a questions: why it works with toList It's quite simple. List.map() eagerly creates new List. Period. The previous one is no longer referenced and eligible for GC.
I am curious about List.updated. What is it's runtime? And how does it compare to just changing one element in an ArrayBuffer? In the background, how does it deal with copying all of the list? Is this an O(n) procedure? If so, is there an immutable data structure that has an updated like method without being so slow?
An example is:
val list = List(1,2,3)
val list2 = list.updated(2, 5) --> # list2 = (1,5,3)
var abuf = ArrayBuffer(1,2,3)
abuf(2) = 5 --> # abuf = (1,5,3)
The time and memory complexity of the updated(index, value) method is linear in terms of index (not in terms of the size of the list). The first index cells are recreated.
Changing an element in an ArrayBuffer has constant time and memory complexity. The backing array is updated in place, no copying occurs.
This updated method is not slow if you update elements near the beginning of the list. For larger sequences, Vector has a different way to share common parts of the list and will probably do less copying.
List.updated is an O(n) operation (linear).
It calls the linear List.splitAt operation to split the list at the index to get two list (before, rest), then builds a new list by appending the elements in before, the updated element and then the elements in rest.tail.
I'm not sure - this would have to be tested, but it seems that if the updated element was at the start of the list, it may be pretty efficient as in theory getting rest and appending rest.tail could be done in constant time.
I suppose performance would be O(n) since list doesn't store index to each element and implemented as links to next el -> el2 -> el3` so only list.head operation are O(1) as fast.
You should use IndexedSeq for that purpose with most common implmentation Vector.
Although it doesn't copy any data so only 1 value are actually updated in memory.
In general all scala immutable collections dosn't copy all data on modification or creation of updated new instance. It is key difference with Java collections.