Lazy eager map evaluation - scala

There are basically two options to evaluate a map in Scala.
Lazy evaluation computers the function that is passed as a parameter when the next value is needed. IF the function takes one hour to execute then it's one hour to wait when the value is needed. (e.g. Stream and Iterator)
Eager evaluation computes the function when the map is defined. It produces a new list (Vector or whatever) and stores the results, making the program to be busy in that time.
With Future we can obtain the list (Seq or whatever) in a separate thread, this means that our thread doesn't block, but the results have to be stored.
So I did something different, please check it here.
This was a while ago so I don't remember whether I tested it. The point is to have a map that applies concurrently (non-blocking) and kind of eagerly to a set of elements, filling a buffer (the size of the number of cores in the computer, and not more). This means that:
The invocation of the map doesn't block the current thread.
Obtaining an element doesn't block the current thread (in case there was time to calculate it before and store the result in the buffer).
Infinite lists can be handled because we only prefetch a few results (about 8, depending on the number of cores).
So this all sounds very good and you may be wondering what's the problem. The problem is that this solution is not particularly elegant IMHO. Assuming the code I shared works in Java and or Scala, to iterate over the elements in the iterable produced by the map I would only need to write:
new CFMap(whateverFunction).apply(whateverIterable)
However what I would like to write is something like:
whateverIterable.bmap(whateverFunction)
As it is usual in Scala (the 'b' is for buffered), or perhaps something like:
whateverThing.toBuffered.map(whateverFunction)
Either of them works for me. So the question is, how can I do this in an idiomatic way in Scala? Some options:
Monads: create a new collection "Buffered" so that I can use the toBuffered method (that should be added to the previous ones as an implicit) and implement map and everything else for this Buffered thing (sounds like quite some work).
Implicits: create an implicit method that transforms the usual collections or the superclass of them (I'm not sure about which one should it be, Iterable maybe?) to something else to which I can apply the .bmap method and obtain something from it, probably an iterable.
Other: there are probably many options I have not considered so far. It's possible that some library does already implement this (I'd be actually surprised of the opposite, I can't believe nobody thought of this before). Using something that has already been done is usually a good idea.
Please let me know if something is unclear.

What you're looking for is the "pimp-my-library" pattern. Check it out:
object CFMapExtensions {
import sanity.commons.functional.CFMap
import scala.collection.JavaConversions._
implicit class IterableExtensions[I](i: Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(asJavaIterable(i))
}
implicit class JavaIterableExtensions[I](i: java.lang.Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(i)
}
// Add an implicit conversion to a java function.
import java.util.function.{Function => JFunction}
implicit def toJFunction[I, O](f: Function1[I, O]): JFunction[I, O] = {
new JFunction[I, O]() {
def apply(t: I): O = f(t)
}
}
}
object Test extends App {
import CFMapExtensions._
List(1,2,3,4).bmap(_ + 5).foreach(println)
}

Related

Is it safe to remove elements from a collection.mutable.HashSet during iteration?

A mutable Set's retain method is implemented as follows:
def retain(p: A => Boolean): Unit =
for (elem <- this.toList) // SI-7269 toList avoids ConcurrentModificationException
if (!p(elem)) this -= elem
But if I implement my own method that doesn't make a copy for iterating, nothing blows up.
def dumbRetain[A](self: mutable.Set[A], p: A => Boolean): Unit =
for (elem <- self)
if (!p(elem)) self -= elem
dumbRetain(mutable.HashSet(1,2,3,4,5,6), Set(2,4,6))
// everything is ok
I see that SI-7269's test case uses the JavaConversions wrapper around a java Set/Map, and it seems like the issue arises from the underlying java collection.
I know there will never be a java collection passed to my algorithm, so can I use dumbRetain without worrying about the ConcurrentModificationException? Or is this "coincidental behavior" that I shouldn't rely on?
edit to clarify, I would be using dumbRetain as an implementation detail in an algorithm which would be in full control of what it passes to dumbRetain. And this would be run in a single-threaded context.
This seems to rely on the specific implementation of mutable.HashSet, and there is nothing in the API that guarantees that it would work for all other implementations of mutable.Set, even if we exclude all wrappers for the Java collections.
The for-loop
for (elem <- self) {
...
}
is desugared into foreach, which for mutable.HashSet is implemented as follows:
override def foreach[U](f: A => U) {
var i = 0
val len = table.length
while (i < len) {
val curEntry = table(i)
if (curEntry ne null) f(entryToElem(curEntry))
i += 1
}
}
Essentially, it simply loops through the Array of the underlying FlatHashTable, and invokes the passed function f on every element. The whole foreach simply does not have any lines which could throw anything, it doesn't check for concurrent [footnote-1] modifications at all.
A ConcurrentModificationException seems to be the less troubling case: at least, your program fails fast, and even returns a detailed stack trace that points to the line in which the problem occurred. It would be actually much worse if it simply deteriorated into undefined behavior without throwing anything. This would be the worst case. However, this worst case shouldn't occur for collections from the standard library: Throw ConcurrentModificationException exception's in scala collections? #188
Quote:
In scala/scala#5295 (merged in to 2.12.x) I made sure that removing the element last returned from an iterator would not cause a problem for the iterator.
So, as long as you clearly state in the documentation that only the collections from standard library are supported, you will most likely not have any problems using it in your own code. But if you use it in a public interface, this would be an invitation for a bug analogous to "SI-7269" quoted in your question.
[footnote-1] "concurrent" as in "ConcurrentModificationException", not as in "concurrently executed threads".
EDIT: I've tried to choose less ambiguous formulations. Great Thanks #Dima for the feedback and the numerous suggestions.
Yeah, you can do it, as long as you are sure this is the scala's native HashSet implementation, not a wrapper around java ... and with understanding, that this is not thread-safe, and should never be used concurrently (the original HashSet.retain is that way too as well as the other mutators).
Better yet, just use immutable Set.filter, unless you actually have real hard evidence (not just intuition) demonstrating that your specific case absolutely requires mutable container.

Can Try be lazy or eager in Scala?

AFAIK, Iterator.map is lazy while Vector.map is eager, basically because they are different types of monads.
I would like to know if there is any chance of having a EagerTry and LazyTry that behave just like the current Try, but with the latter (LazyTry) delaying the execution of the closure passed until the result is needed (if it is needed).
Please note that declaring stuff as lazy doesn't work quite well in Scala, in particular it works for a given scope. An alternative exists when passing parameters (parameters by name). The question is how to achieve lazy behaviour when returning (lazy) values to an outer scope. Option is basically a collection of length 0 or 1, this would be an equivalent case for lazy collections (Iterator, Sequence) but limited to length 0 or 1 (like Option and Either). I'm particularly interested in Try, i.e. using LazyTry exactly as Try would be used. I guess this should be analogous in other cases (Option and Either).
Please note that we already have EagerTry, as the current standard Try is eager. Unfortunately, the class is sealed, therefore, to have eager and lazy versions of the same class we would need to define the three of them and implement two of them (as opposed to defining and implementing one). The point is returning a Try without other software layers worrying about the time of execution of that code, i.e. abstraction.
Yes, it wouldn't be hard to write LazyTry. One possible approach:
sealed class LazyTry[A](block: => A) {
// the only place block is used
private lazy val underlying: Try[A] = Try(block)
def get = underlying.get
def isSuccess = underlying.isSuccess
...
}
object LazyTry {
def apply[A](block: => A): LazyTry[A] = new LazyTry[A](block)
...
}
Note that you don't have LazySuccess and LazyFailure, since you don't know which class to use before running block.

Scala (method / function).hashCode static value?

I' was curious if the hashCode() function returns always the same value for a lambda or something in Scala?
My tests have shown to me some static value that does not change even over builds. Is this intended behavior or may it change in future?
If this was some static behavior it would help me a lot building my library.
EDIT:
Let's take this source code:
object Main {
def main(args: Array[String]): Unit = {
val x = (s: String) => 1
val y = (s: String) => 2
println(x.hashCode())
println(y.hashCode())
}
}
It's output on console is for me always 1792393294 and 226170135.
What I'm currently doing is implementing a parser combinator library which I implemented in several languages. And I need to know when wrapper classes are the same (e.g. the underlying functions are the same) so I can implement something like a call stack, which I need to parse as far as possible on failure but prevent endless recursion on error.
Thanks in advance!
The default implementation for hashCode (at least in the Oracle JVM) is in terms of the (initial) memory address of that particular object. So if your program has constructed exactly the same sized objects in exactly the same order before constructing that object, it will in practice return the same value every time.
But this is not at all reliable; most programs do not do exactly the same things every time they run. As soon as you're doing something like responding to user input, that perfect reproducibility will disappear - e.g. maybe you sometimes add enough entries to a HashMap to trigger an enlargement of the table, and sometimes not. And if you construct the same value later in the program, it will of course have a different address; try doing
val z = (s: String) => 1
and observe that it will have a different hashCode from x. Not to mention that the numbers may well be different across different JVMs, different versions of the same JVM, or even when the same JVM is launched with a different -Xms setting.
Computers are often a lot more deterministic in practice than in theory. But this is not the kind of thing that's specified to happen, and certainly not something to rely on in your programs.

Scala Iterable Memory Leaks

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.
The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.
Here's my repro code:
class FileIterable(file: java.io.File) extends Iterable[String] {
override def iterator = io.Source.fromFile(file).getLines
}
// Iterator
// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines
// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator
// Iterable
// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable
// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)
def values = lines
.drop(1)
//.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
//.filter(l => l.startsWith("*"))
val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()
The file it's reading is ~10GB with 1MB lines.
The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)
The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.
The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.
My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.
Any ideas?
Note how the ScalaDoc of Iterable says:
Implementations of this trait need to provide a concrete method with
signature:
def iterator: Iterator[A]
They also need to provide a method newBuilder which creates a builder
for collections of the same kind.
Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.
You might want to implement Iterable.drop as
def drop(n: Int) = iterator.drop(n).toIterable
but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).

val-mutable versus var-immutable in Scala

Are there any guidelines in Scala on when to use val with a mutable collection versus using var with an immutable collection? Or should you really aim for val with an immutable collection?
The fact that there are both types of collection gives me a lot of choice, and often I don't
know how to make that choice.
Pretty common question, this one. The hard thing is finding the duplicates.
You should strive for referential transparency. What that means is that, if I have an expression "e", I could make a val x = e, and replace e with x. This is the property that mutability break. Whenever you need to make a design decision, maximize for referential transparency.
As a practical matter, a method-local var is the safest var that exists, since it doesn't escape the method. If the method is short, even better. If it isn't, try to reduce it by extracting other methods.
On the other hand, a mutable collection has the potential to escape, even if it doesn't. When changing code, you might then want to pass it to other methods, or return it. That's the kind of thing that breaks referential transparency.
On an object (a field), pretty much the same thing happens, but with more dire consequences. Either way the object will have state and, therefore, break referential transparency. But having a mutable collection means even the object itself might lose control of who's changing it.
If you work with immutable collections and you need to "modify" them, for example, add elements to them in a loop, then you have to use vars because you need to store the resulting collection somewhere. If you only read from immutable collections, then use vals.
In general, make sure that you don't confuse references and objects. vals are immutable references (constant pointers in C). That is, when you use val x = new MutableFoo(), you'll be able to change the object that x points to, but you won't be able to change to which object x points. The opposite holds if you use var x = new ImmutableFoo(). Picking up my initial advice: if you don't need to change to which object a reference points, use vals.
The best way to answer this is with an example. Suppose we have some process simply collecting numbers for some reason. We wish to log these numbers, and will send the collection to another process to do this.
Of course, we are still collecting numbers after we send the collection to the logger. And let's say there is some overhead in the logging process that delays the actual logging. Hopefully you can see where this is going.
If we store this collection in a mutable val, (mutable because we are continuously adding to it), this means that the process doing the logging will be looking at the same object that's still being updated by our collection process. That collection may be updated at any time, and so when it's time to log we may not actually be logging the collection we sent.
If we use an immutable var, we send an immutable data structure to the logger. When we add more numbers to our collection, we will be replacing our var with a new immutable data structure. This doesn't mean collection sent to the logger is replaced! It's still referencing the collection it was sent. So our logger will indeed log the collection it received.
I think the examples in this blog post will shed more light, as the question of which combo to use becomes even more important in concurrency scenarios: importance of immutability for concurrency. And while we're at it, note the preferred use of synchronised vs #volatile vs something like AtomicReference: three tools
var immutable vs. val mutable
In addition to many excellent answers to this question. Here is a simple example, that illustrates potential dangers of val mutable:
Mutable objects can be modified inside methods, that take them as parameters, while reassignment is not allowed.
import scala.collection.mutable.ArrayBuffer
object MyObject {
def main(args: Array[String]) {
val a = ArrayBuffer(1,2,3,4)
silly(a)
println(a) // a has been modified here
}
def silly(a: ArrayBuffer[Int]): Unit = {
a += 10
println(s"length: ${a.length}")
}
}
Result:
length: 5
ArrayBuffer(1, 2, 3, 4, 10)
Something like this cannot happen with var immutable, because reassignment is not allowed:
object MyObject {
def main(args: Array[String]) {
var v = Vector(1,2,3,4)
silly(v)
println(v)
}
def silly(v: Vector[Int]): Unit = {
v = v :+ 10 // This line is not valid
println(s"length of v: ${v.length}")
}
}
Results in:
error: reassignment to val
Since function parameters are treated as val this reassignment is not allowed.