val-mutable versus var-immutable in Scala - scala

Are there any guidelines in Scala on when to use val with a mutable collection versus using var with an immutable collection? Or should you really aim for val with an immutable collection?
The fact that there are both types of collection gives me a lot of choice, and often I don't
know how to make that choice.

Pretty common question, this one. The hard thing is finding the duplicates.
You should strive for referential transparency. What that means is that, if I have an expression "e", I could make a val x = e, and replace e with x. This is the property that mutability break. Whenever you need to make a design decision, maximize for referential transparency.
As a practical matter, a method-local var is the safest var that exists, since it doesn't escape the method. If the method is short, even better. If it isn't, try to reduce it by extracting other methods.
On the other hand, a mutable collection has the potential to escape, even if it doesn't. When changing code, you might then want to pass it to other methods, or return it. That's the kind of thing that breaks referential transparency.
On an object (a field), pretty much the same thing happens, but with more dire consequences. Either way the object will have state and, therefore, break referential transparency. But having a mutable collection means even the object itself might lose control of who's changing it.

If you work with immutable collections and you need to "modify" them, for example, add elements to them in a loop, then you have to use vars because you need to store the resulting collection somewhere. If you only read from immutable collections, then use vals.
In general, make sure that you don't confuse references and objects. vals are immutable references (constant pointers in C). That is, when you use val x = new MutableFoo(), you'll be able to change the object that x points to, but you won't be able to change to which object x points. The opposite holds if you use var x = new ImmutableFoo(). Picking up my initial advice: if you don't need to change to which object a reference points, use vals.

The best way to answer this is with an example. Suppose we have some process simply collecting numbers for some reason. We wish to log these numbers, and will send the collection to another process to do this.
Of course, we are still collecting numbers after we send the collection to the logger. And let's say there is some overhead in the logging process that delays the actual logging. Hopefully you can see where this is going.
If we store this collection in a mutable val, (mutable because we are continuously adding to it), this means that the process doing the logging will be looking at the same object that's still being updated by our collection process. That collection may be updated at any time, and so when it's time to log we may not actually be logging the collection we sent.
If we use an immutable var, we send an immutable data structure to the logger. When we add more numbers to our collection, we will be replacing our var with a new immutable data structure. This doesn't mean collection sent to the logger is replaced! It's still referencing the collection it was sent. So our logger will indeed log the collection it received.

I think the examples in this blog post will shed more light, as the question of which combo to use becomes even more important in concurrency scenarios: importance of immutability for concurrency. And while we're at it, note the preferred use of synchronised vs #volatile vs something like AtomicReference: three tools

var immutable vs. val mutable
In addition to many excellent answers to this question. Here is a simple example, that illustrates potential dangers of val mutable:
Mutable objects can be modified inside methods, that take them as parameters, while reassignment is not allowed.
import scala.collection.mutable.ArrayBuffer
object MyObject {
def main(args: Array[String]) {
val a = ArrayBuffer(1,2,3,4)
silly(a)
println(a) // a has been modified here
}
def silly(a: ArrayBuffer[Int]): Unit = {
a += 10
println(s"length: ${a.length}")
}
}
Result:
length: 5
ArrayBuffer(1, 2, 3, 4, 10)
Something like this cannot happen with var immutable, because reassignment is not allowed:
object MyObject {
def main(args: Array[String]) {
var v = Vector(1,2,3,4)
silly(v)
println(v)
}
def silly(v: Vector[Int]): Unit = {
v = v :+ 10 // This line is not valid
println(s"length of v: ${v.length}")
}
}
Results in:
error: reassignment to val
Since function parameters are treated as val this reassignment is not allowed.

Related

Is it possible for a structure made of immutable types to have a cycle?

Consider the following:
case class Node(var left: Option[Node], var right: Option[Node])
It's easy to see how you could traverse this, search it, whatever. But now imagine you did this:
val root = Node(None, None)
root.left = root
Now, this is bad, catastrophic. In fact, you type it into a REPL, you'll get a StackOverflow (hey, that would be a good name for a band!) and a stack trace a thousand lines long. If you want to try it, do this:
{ root.left = root }: Unit
to suppress the REPL well-intentioned attempt to print out the results.
But to construct that, I had to specifically give the case-class mutable members, something I would never do in real life. If I use ordinary mutable members, I get a problem with construction. The closest I can come is
case class Node(left: Option[Node], right: Option[Node])
val root: Node = Node(Some(loop), None)
Then root has the rather ugly value Node(Some(null),None), but it's still not cyclic.
So my question is, if a data-structure is transitively immutable (that is, all of its members are either immutable values or references to other data-structures that are themselves transitively immutable), is it guaranteed to be acyclic?
It would be cool if it were.
Yes, it is possible to create cyclic data structures even with purely immutable data structures in a pure, referentially transparent, effect-free language.
The "obvious" solution is to pull out the potentially cyclic references into a separate data structure. For example, if you represent a graph as an adjacency matrix, then you don't need cycles in your data structure to represent cycles in your graph. But that's cheating: every problem can be solved by adding a layer of indirection (except the problem of having too many layers of indirection).
Another cheat would be to circumvent Scala's immutability guarantees from the outside, e.g. on the default Scala-JVM implementation by using Java reflection methods.
It is possible to create actual cyclic references. The technique is called Tying the Knot, and it relies on laziness: you can actually set the reference to an object that you haven't created yet because the reference will be evaluated lazily, by which time the object will have been created. Scala has support for laziness in various forms: lazy vals, by-name parameters, and the now deprecated DelayedInit. Also, you can "fake" laziness using functions or method: wrap the thing you want to make lazy in a function or method which produces the thing, and it won't be created until you call the function or method.
So, the same techniques should be possible in Scala as well.
How about using lazy with call by name ?
scala> class Node(l: => Node, r: => Node, v: Int)
// defined class Node
scala> lazy val root: Node = new Node(root, root, 5)
// root: Node = <lazy>

scala.collection.breakOut vs views

This SO answer describes how scala.collection.breakOut can be used to prevent creating wasteful intermediate collections. For example, here we create an intermediate Seq[(String,String)]:
val m = List("A", "B", "C").map(x => x -> x).toMap
By using breakOut we can prevent the creation of this intermediate Seq:
val m: Map[String,String] = List("A", "B", "C").map(x => x -> x)(breakOut)
Views solve the same problem and in addition access elements lazily:
val m = (List("A", "B", "C").view map (x => x -> x)).toMap
I am assuming the creation of the View wrappers is fairly cheap, so my question is: Is there any real reason to use breakOut over Views?
You're going to make a trip from England to France.
With view: you're taking a set of notes in your notebook and boom, once you've called .force() you start making all of them: buy a ticket, board on the plane, ....
With breakOut: you're departing and boom, you in the Paris looking at the Eiffel tower. You don't remember how exactly you've arrived there, but you did this trip actually, just didn't make any memories.
Bad analogy, but I hope this give you a taste of what is the difference between them.
I don't think views and breakOut are identical.
A breakOut is a CanBuildFrom implementation used to simplify transformation operations by eliminating intermediary steps. E.g get from A to B without the intermediary collection. A breakOut means letting Scala choose the appropriate builder object for maximum efficiency of producing new items in a given scenario. More details here.
views deal with a different type of efficiency, the main sale pitch being: "No more new objects". Views store light references to objects to tackle different usage scenarios: lazy access etc.
Bottom line:
If you map on a view you may still get an intermediary collection of references created before the expected result can be produced. You could still have superior performance from:
collection.view.map(somefn)(breakOut)
Than from:
collection.view.map(someFn)
As of Scala 2.13, this is no longer a concern. Breakout has been removed and views are the recommended replacement.
Scala 2.13 Collections Rework
Views are also the recommended replacement for collection.breakOut.
For example,
val s: Seq[Int] = ...
val set: Set[String] = s.map(_.toString)(collection.breakOut)
can be expressed with the same performance characteristics as:
val s: Seq[Int] = ...
val set = s.view.map(_.toString).to(Set)
What flavian said.
One use case for views is to conserve memory. For example, if you had a million-character-long string original, and needed to use, one by one, all of the million suffixes of that string, you might use a collection of
val v = original.view
val suffixes = v.tails
views on the original string. Then you might loop over the suffixes one by one, using suffix.force() to convert them back to strings within the loop, thus only holding one in memory at a time. Of course, you could do the same thing by iterating with your own loop over the indices of the original string, rather than creating any kind of collection of the suffixes.
Another use-case is when creation of the derived objects is expensive, you need them in a collection (say, as values in a map), but you only will access a few, and you don't know which ones.
If you really have a case where picking between them makes sense, prefer breakOut unless there's a good argument for using view (like those above).
Views require more code changes and care than breakOut, in that you need to add force() where needed. Depending on context, failure to do so is
often only detected at run-time. With breakOut, generally if it
compiles, it's right.
In cases where view does not apply, breakOut
will be faster, since view generation and forcing is skipped.
If you use a debugger, you can inspect the collection contents, which you
can't meaningfully do with a collection of views.

Scala lazy collection growth

This question is a bit more theoretical.
I have an object which holds a private mutable list or map, whose growth is append only. I believe I could argue that the object itself is functional, being functionally transparent and to all appearances, immutable. So for example I have
import scala.collection.mutable.Map
case class Foo(bar:String)
object FooContainer {
private val foos = Map.empty[String, Foo]
def getFoo(fooName:String) = foos.getOrElseUpdate(fooName, Foo(fooName))
}
I could do a similar case with lists. Now I imagine to be truly functional, I would need to ensure thread safety, which I suppose I could do simply using locks, synchronise or atomic references. My questions being is this all necessary & sufficient, is it a common practice, are there established patterns for this behaviour?
I would say that "functional" is really the wrong term. The exact example shown above could be called "pure" in the sense that its output is only ever defined by its input, thus not having any (visible) side-effects. In reality it is impure though, since it has hidden internal state.
The function always evaluates the same result value given the same argument value(s). The function result value cannot depend on any hidden information or state that may change as program execution proceeds or between different executions of the program, nor can it depend on any external input from I/O devices. Wikipedia
The apparent impurity vanishes though, if you make the Foo class mutable:
case class Foo(bar:String) {
private var mutable:Int = 1
def setFoo(x:Int) = mutable = x
def foo = mutable
}
The mutability of the Foo class results in getFoo being impure:
scala> FooContainer.getFoo("test").foo
res5: Int = 1
scala> FooContainer.getFoo("bla").setFoo(5)
scala> FooContainer.getFoo("bla").foo
res7: Int = 5
In order for the function to be apparently pure, FooContainer.getFoo("bla").foo must always return the same value. Therefore, the "purity" of the described construct is fragile at best.
As a general rule, you're better off with a var holding an immutable collection type which is replaced when updated than you are with a val holding a mutable type. The former never deals in values that may change after being created (other than the class which holds the evolving value itself). Then you only need to be moderately careful about when updated values held by the var are published to it, but since updating a reference is generally atomic (apart from things like storing a long or a double to RAM on a 32-bit machine), there's little risk.

Scala immutable map, when to go mutable?

My present use case is pretty trivial, either mutable or immutable Map will do the trick.
Have a method that takes an immutable Map, which then calls a 3rd party API method that takes an immutable Map as well
def doFoo(foo: String = "default", params: Map[String, Any] = Map()) {
val newMap =
if(someCondition) params + ("foo" -> foo) else params
api.doSomething(newMap)
}
The Map in question will generally be quite small, at most there might be an embedded List of case class instances, a few thousand entries max. So, again, assume little impact in going immutable in this case (i.e. having essentially 2 instances of the Map via the newMap val copy).
Still, it nags me a bit, copying the map just to get a new map with a few k->v entries tacked onto it.
I could go mutable and params.put("bar", bar), etc. for the entries I want to tack on, and then params.toMap to convert to immutable for the api call, that is an option. but then I have to import and pass around mutable maps, which is a bit of hassle compared to going with Scala's default immutable Map.
So, what are the general guidelines for when it is justified/good practice to use mutable Map over immutable Maps?
Thanks
EDIT
so, it appears that an add operation on an immutable map takes near constant time, confirming #dhg's and #Nicolas's assertion that a full copy is not made, which solves the problem for the concrete case presented.
Depending on the immutable Map implementation, adding a few entries may not actually copy the entire original Map. This is one of the advantages to the immutable data structure approach: Scala will try to get away with copying as little as possible.
This kind of behavior is easiest to see with a List. If I have a val a = List(1,2,3), then that list is stored in memory. However, if I prepend an additional element like val b = 0 :: a, I do get a new 4-element List back, but Scala did not copy the orignal list a. Instead, we just created one new link, called it b, and gave it a pointer to the existing List a.
You can envision strategies like this for other kinds of collections as well. For example, if I add one element to a Map, the collection could simply wrap the existing map, falling back to it when needed, all while providing an API as if it were a single Map.
Using a mutable object is not bad in itself, it becomes bad in a functional programming environment, where you try to avoid side-effects by keeping functions pure and objects immutable.
However, if you create a mutable object inside a function and modify this object, the function is still pure if you don't release a reference to this object outside the function. It is acceptable to have code like:
def buildVector( x: Double, y: Double, z: Double ): Vector[Double] = {
val ary = Array.ofDim[Double]( 3 )
ary( 0 ) = x
ary( 1 ) = y
ary( 2 ) = z
ary.toVector
}
Now, I think this approach is useful/recommended in two cases: (1) Performance, if creating and modifying an immutable object is a bottleneck of your whole application; (2) Code readability, because sometimes it's easier to modify a complex object in place (rather than resorting to lenses, zippers, etc.)
In addition to dhg's answer, you can take a look to the performance of the scala collections. If an add/remove operation doesn't take a linear time, it must do something else than just simply copying the entire structure. (Note that the converse is not true: it's not beacuase it takes linear time that your copying the whole structure)
I like to use collections.maps as the declared parameter types (input or return values) rather than mutable or immutable maps. The Collections maps are immutable interfaces that work for both types of implementations. A consumer method using a map really doesn't need to know about a map implementation or how it was constructed. (It's really none of its business anyway).
If you go with the approach of hiding a map's particular construction (be it mutable or immutable) from the consumers who use it then you're still getting an essentially immutable map downstream. And by using collection.Map as an immutable interface you completely remove all the ".toMap" inefficiency that you would have with consumers written to use immutable.Map typed objects. Having to convert a completely constructed map into another one simply to comply to an interface not supported by the first one really is absolutely unnecessary overhead when you think about it.
I suspect in a few years from now we'll look back at the three separate sets of interfaces (mutable maps, immutable maps, and collections maps) and realize that 99% of the time only 2 are really needed (mutable and collections) and that using the (unfortunately) default immutable map interface really adds a lot of unnecessary overhead for the "Scalable Language".

Using alternative comparison in HashSet

I stumbled across this problem when creating a HashSet[Array[Byte]] to use in a kind of HatTrie.
Apparently, the standard equals() method on arrays checks for identity. How can I provide the HashSet with an alternative Comparator that uses .deepEquals() for checking if an element is contained in the set?
Basically, I want this test to pass:
describe ("A HashSet of Byte Array") {
it("must contain arrays that are equivalent to one that has been added") {
val set = new HashSet[Array[Byte]]()
set += "ab".getBytes("UTF-8")
set must contain ("ab".getBytes("UTF-8"))
}
}
I cannot feasibly wrap the Array[Byte] into another object because there's a lot of them. Short of writing a new HashSet implementation for this purpose is there anything I can do?
Mutable data structures, such as Arrays, are contra-indicated for usage in places where the hash code is used. This is because the data structure can change, thus changing the hash code of the data, thus making access to the data inaccurate.
For instance, let's say I have a binary tree to store elements based on their hash code. If the hash is even, I store the data on the left side, if odd on the right side. Then I divide the hash by two, and repeat the process until the hash is 0, at which point I store the data in the node.
Now, I use this structure as base for HashSet, and then store an array on it. The array has an even hash code, so it goes to the left side of the tree. Let's ignore it's exact position.
Later, I change the array, and then look it up on the set. Now the hash code is odd, and I go look on the right side of the tree, and thus can't find it, even though it is stored int he tree -- just on the other side.
So, don't use arrays with hash-based collections. Which doesn't answer your question, of course.
As for your question, you'd have to subclass HashSet, and then override the equals method. I don't know if HashSet is final or descendent from a sealed class, so I don't know if this is viable.
Another option would be creating an alternate comparision method -- not named equals or "==", based specifically on deepEquals, and then using the Pimp My Class method to add it to HashSet.
Edit
I did mean subclass HashSet, but I did not pay enough attention to the question. I thought you were comparing the entire HashSet, instead of just using contains. You could do this:
class MyHashSet[A] extends scala.collection.mutable.HashSet[A] {
override def contains(elem: A): Boolean = elem match {
case arr : Array[_] => this.elements exists (arr deepEquals _)
case _ => super.contains(elem)
}
}
This isn't actually working here, as the first case is not being followed. I'm really lost here, as simple tests on REPL seems to indicate it ought to work. I'm thinking it might have something to do with boxing, but I'm not real clear on what -- or I'd have it working. :-)