What goes on behind the scenes when adding immutable collections in Scala? - scala

Working in Scala, I have encounter immutable items, for this example immutable.Map. There are times where code that I do not control (Spark) returns an immutable.Map that I want to process and add elements too. I am using the following approach, because it compiles and runs. I am hoping the computer is smart enough to do this efficiently, but do not believe I should make that assumption.
var map: immutable.Map[Int, Double] = getMapFromSomewhere()
var i = 0
while(i < 5){
map += (i -> 0.0)
i +=1
}
I am hoping that this takes my new map item, places it into memory and does not make a copy of Map, that has to be cleaned up by garbage collection. Should I be creating a mutable.Map from my immutable.Map to do these types of operations instead?

When you "add" to an immutable collection, you are really creating a new Collection, which ideally and usually shares the same memory and data with the old Collection. This is safe, because since the Collections are immutable, you don't need to worry that a change in one will corrupt the other.
Your code is... not so great. That's a terribly ugly style for Scala, and your types are off. (There's no such thing as "immutable.Map[Double]", since Map takes two type parameters. You are building an immutable.Map[Int,Double], I guess.)
Here's a less ugly way to build what you are trying to build:
(0 until 5).map( i => (i, 0.0) ).toMap
or, more precisely, since you may be starting with a nonempty map
getMapFromSomwhere() ++ (0 until 5).map(i =>(i, 0.0))
Reserve mutable data structures for special cases where you really need them, and use them only if you have carefully thought through how you will manage any concurrency or if you can guarantee there will be no concurrent access. Your default in Scala should be immutable datastructures built and manipulated in a functional style, avoiding explicit external iteration of the sort in your example. You should use the keyword "var" only rarely, like mutable datastructures, only for special cases you have thought through carefully.

The data structures in functional programming languages are not just simply immutable(their reference can't be changes once it is created) but also persistent. By persistent means it reuses the existing collection for some of the operations. For example, in Scala prepending an element to the list is optimized(So when you are using list, you should think append operation as kind of pushing an element to stack).
Similarly, other collections are optimized as well for other operations.
I gave you few references that help you to get more understanding on persistent data structures in functional programming.
Persistent data structures in Scala
2.https://www.packtpub.com/mapt/book/application_development/9781783985845/3/ch03lvl1sec25/persistent-data-structures
https://www.youtube.com/watch?v=pNhBQJN44YQ
https://www.youtube.com/watch?v=T0yzrZL1py0

Related

How is Scala so efficient with lists?

It's usually considered a bad practice to make unnecessary collections in Java as it consumes some memory and CPU. Scala seems to be pretty efficient with it and encourages to use immutable data structures.
How is Scala so efficient with Lists? What techniques are used to achieve that?
While the comments are correct that the claim that list is particularly efficient is a dubious one, it does much better than doing full copies of the collection for every operation like you would do with Java's standard collections.
The reason for this is List and the other immutable collections are not just mutable collections with mutation methods returning a copy, but are designed differently to with immutability in mind. They Take advantage of something called "structural sharing". If parts of a collection remain the same after a change, then those parts don't need to be copied and the same object can be shared across multiple collections. This works because of immutability, there is no change that they could be altered so it's safe to share.
Imagine the simplest example, prepending to a list.
You have a List(1,2,3) and you want to prepend 0
val original = List(1,2,3)
val updated = 0 :: original
You list would then look something like this
updated original
\ \
0 - - - 1 - - - 2 - - - 3
All that's needed is to create a new node and point it's tail to the head of your original list. Nothing needs to be copied. Similarly the tail and drop operations just need to return a reference to the appropriate node and nothing needs to be copied. This is why List can be quite good with the prepend and tail operations, because it doesn't do any copying even though it creates a "new" List.
Other List operations do require some amount copying, but always as little as possible. As long as part of the tail of a list is unchanged it doesn't need to be copied. For example when concatenating lists, the first list needs to be copied, but then it's tail can just point to the head of the second, so the second list doesn't need to be copied at all. This is why, when concatenating a long and short list it's better to put the shorter list on the "left" as it is the only one that needs to be copied.
Other types of collections are better at different operations. Vector for example can to both prepend and append in amortized constant time, as well as having good random access and update capabilities (though still much worse than a raw mutable array). In most cases it will be more efficient than List while still being immutable. It's implementation is quite complicated. It uses a trie datastructure, with many internal arrays to store data. The unchanged ones can be shared and only the ones that need to be altered by an update operation need to be copied.

Efficient way to build collections from other collections

In Scala, as in many other languages, it is possible to build collections using the elements contained in other collections.
For example, it is possible to heapify a list:
import scala.collection.mutable.PriorityQueue
val l = List(1,2,3,4)
With:
val pq = PriorityQueue(l:_*)
or:
val pq = PriorityQueue[Int]() ++ l
These are, from my point of view, two quite different approaches:
Use a variadic constructor and collection:_* which, at the end of the day, dumps the collection in an intermediate array.
Build an empty target collection and use the ++ method to add all the source collection elements.
From an aesthetic point of view I do prefer the first option but I am worried about collection:_*. I understand form "Programming In Scala" that variadic functions are translated into functions receiving an array.
Is it, in general, the second option a better solution in terms of efficiency?
The second one might be faster in some cases, but apparently when the original collection is a Seq (such as your List), Scala tries to avoid the array creation; see here.
But, realistically, it probably will not ever make a difference anyway unless you are dealing with huge collections in tight loops. These kinds of things aren't worth worrying about, so do whichever one you like; you can spare the milliseconds.

Scala's toList function appears to be slow

I was under the impression that calling seq.toList() on an immutable Seq would be making a new list which is sharing the structural state from the first list. We're finding that this could be really slow and I'm not sure why. It is just sharing the structural state, correct? I can't see why it'd be making an n-time copy of all the elements when it knows they'll never change.
A List in Scala is a particular data structure: instances of :: each containing a value, followed by Nil at the end of the chain.
If you toList a List, it will take O(1) time. If you toList on anything else then it must be converted into a List, which involves O(n) object allocations (all the :: instances).
So you have to ask whether you really want a scala.collection.immutable.List. That's what toList gives you.
Sharing structural state is possible for particular operations on particular data structures.
With the List data structure in Scala, my understanding is that every element refers to the next, starting from the head through the tail, so a singly linked list.
From a structural state sharing perspective, consider the restrictions placed on this from the internal data structure perspective. Adding an element to the head of a List (X) effectively creates a new list (X') with the new element as the head of X' and the old list (X) as the tail. For this particular operation, internal state can be shared completely.
The same operation above can be applied to create a new List (X'), with the new element as the head of X' and any element from X as the tail, as long as you accept that the tail will be the element you choose from X, plus all additional elements it already has in it's data structure.
When you think about it logically, each data structure has an internal structure that allows some operations to be performed with simple shared internal structure and other operations requiring more invasive and costly computations.
The key from my perspective here is having an understanding of the constraints placed on the operations by the internal data structure itself.
For example, consider the same operations above on a doubly linked list data structure and you will see that there are quite different restrictions.
Personally, I find drawing out an understanding of the internal structure can be helpful in understanding the consequences of particular operations.
In the case of the toList operation on an arbitrary sequence, with no knowledge of the arbitrary sequences internal data structure, one has to therefore assume O(n). List.toList has the obvious performance advantage of already being a list.

Scala: var List vs val MutableList

In Odersky et al's Scala book, they say use lists. I haven't read the book cover to cover but all the examples seem to use val List. As I understand it one also is encouraged to use vals over vars. But in most applications is there not a trade off between using a var List or a val MutableList?. Obviously we use a val List when we can. But is it good practice to be using a lot of var Lists (or var Vectors etc)?
I'm pretty new to Scala coming from C#. There I had a lot of:
public List<T> myList {get; private set;}
collections which could easily have been declared as vals if C# had immutability built in, because the collection itself never changed after construction, even though elements would be added and subtracted from the collection in its life time. So declaring a var collection almost feels like a step away from immutability.
In response to answers and comments, one of the strong selling points of Scala is: that it can have many benefits without having to completely change the way one writes code as is the case with say Lisp or Haskell.
Is it good practice to be using a lot of var Lists (or var Vectors
etc)?
I would say it's better practice to use var with immutable collections than it is to use val with mutable ones. Off the top of my head, because
You have more guarantees about behaviour: if your object has a mutable list, you never know if some other external object is going to update it
You limit the extent of mutability; methods returning a collection will yield an immutable one, so you only have mutablility within your one object
It's easy to immutabilize a var by simply assigning it to a val, whereas to make a mutable collection immutable you have to use a different collection type and rebuild it
In some circumstances, such as time-dependent applications with extensive I/O, the simplest solution is to use mutable state. And in some circumstances, a mutable solution is just more elegant. However in most code you don't need mutability at all. The key is to use collections with higher order functions instead of looping, or recursion if a suitable function doesn't exist. This is simpler than it sounds. You just need to spend some time getting to know the methods on List (and other collections, which are mostly the same). The most important ones are:
map: applies your supplied function to each element in the collection - use instead of looping and updating values in an array
foldLeft: returns a single result from a collection - use instead of looping and updating an accumulator variable
for-yield expressions: simplify your mapping and filtering especially for nested-loop type problems
Ultimately, much of functional programming is a consequence of immutability and you can't really have one without the other; however, local vars are mostly an implementation detail: there's nothing wrong with a bit of mutability so long as it cannot escape from the local scope. So use vars with immutable collections since the local vars are not what will be exported.
You are assuming either the List must be mutable, or whatever is pointing to it must be mutable. In other words, that you need to pick one of the two choices below:
val list: collection.mutable.LinkedList[T]
var list: List[T]
That is a false dichotomy. You can have both:
val list: List[T]
So, the question you ought to be asking is how do I do that?, but we can only answer that when you try it out and face a specific problem. There's no generic answer that will solve all your problems (well, there is -- monads and recursion -- but that's too generic).
So... give it a try. You might be interested in looking at Code Review, where most Scala questions pertain precisely how to make some code more functional. But, ultimately, I think the best way to go about it is to try, and resort to Stack Overflow when you just can't figure out some specific problem.
Here is how I see this problem of mutability in functional programming.
Best solution: Values are best, so the best in functional programming usage is values and recursive functions:
val myList = func(4);
def func(n) = if (n>0) n::func(n) else Nil
Need mutable stuff: Sometimes mutable stuff is needed or makes everything a lot easier. My impression when we face this situation is to use the mutables structures, so to use val list: collection.mutable.LinkedList[T] instead of var list: List[T], this is not because of a real improvement on performances but because of mutable functions which are already defined in the mutable collection.
This advice is personal and maybe not recommended when you want performance but it is a guideline I use for daily programming in scala.
I believe you can't separate the question of mutable val / immutable var from the specific use case. Let me deepen a bit: there are two questions you want to ask yourself:
How am I exposing my collection to the outside?
I want a snapshot of the current state, and this snapshot should not change regardless of the changes that are made to the entity hosting the collection. In such case, you should prefer immutable var. The reason is that the only way to do so with a mutable val is through a defensive copy.
I want a view on the state, that should change to reflect changes to the original object state. In this case, you should opt for an immutable val, and return it wrapped through an unmodifiable wrapper (much like what Collections.unmodifiableList() does in Java, here is a question where it's asked how to do so in Scala). I see no way to achieve this with an immutable var, so I believe your choice here is mandatory.
I only care about the absence of side effects. In this case the two approaches are very similar. With an immutable var you can directly return the internal representation, so it is slightly clearer maybe.
How am I modifying my collection?
I usually make bulk changes, namely, I set the whole collection at once. This makes immutable var a better choice: you can just assign what's in input to your current state, and you are fine already. With an immutable val, you need to first clear your collection, then copy the new contents in. Definitely worse.
I usually make pointwise changes, namely, I add/remove a single element (or few of them) to/from the collection. This is what I actually see most of the time, the collection being just an implementation detail and the trait only exposing methods for the pointwise manipulation of its status. In this case, a mutable val may be generally better performance-wise, but in case this is an issue I'd recommend taking a look at Scala's collections performance.
If it is necessary to use var lists, why not? To avoid problems you could for example limit the scope of the variable. There was a similar question a while ago with a pretty good answer: scala's mutable and immutable set when to use val and var.

Which scala mutable list to use?

This is a followup question to No Scala mutable list
I want to use a mutable list in Scala. I can chose from
scala.collection.mutable.DoubleLinkedList
scala.collection.mutable.LinkedList
scala.collection.mutable.ListBuffer
scala.collection.mutable.MutableList
Which is nice, but what is the "standard", recommended, idiomatic scala way? I just want to use a list that I can add things to on the back.
In my case, I am using a HashMap, where the "lists" (I am meaning it in general sense) will be on value side. Then, I am reading something from a file and for every line, I want to find the right list in the hashmap and append the value to the list.
Depends what you need.
DoubleLinkedList is a linked list which allows you to traverse back-and-forth through the list of nodes. Use its prev and next references to go to the previous or the next node, respectively.
LinkedList is a singly linked list, so there are not prev pointers - if you only traverse to the next element of the list all the time, this is what you need.
EDIT: Note that the two above are meant to be used internally as building blocks for more complicated list structures like MutableLists which support efficient append, and mutable.Queues.
The two collections above both have linear-time append operations.
ListBuffer is a buffer class. Although it is backed by a singly linked list data structure, it does not expose the next pointer to the client, so you can only traverse it using iterators and the foreach.
Its main use is, however, as a buffer and an immutable list builder - you append elements to it via +=, and when you call result, you very efficiently get back a functional immutable.List. Unlike mutable and immutable lists, both append and prepend operations are constant-time - you can append at the end via += very efficiently.
MutableList is used internally, you usually do not use it unless you plan to implement a custom collection class based on the singly linked list data structure. Mutable queues, for example, inherit this class. MutableList class also has an efficient constant-time append operation, because it maintains a reference to the last node in the list.
The documentation's Concrete Mutable Collection Classes page (or the one for 2.12) has an overview of mutable list classes, including explanations on when to use which one.
If you want to append items you shouldn't use a List at all. Lists are good when you want to prepend items. Use ArrayBuffer instead.
I just want to use a list that I can add things to on the back.
Then choose something that implements Growable. I personally suggest one of the Buffer implementations.
I stay away from LinkedList and DoubleLinkedList, as they are present mainly as underlying implementation of other collections, but have quite a few bugs up to Scala 2.9.x. Starting with Scala 2.10.0, I expect the various bug fixes have brought them up to standard. Still, they lack some methods people expect, such as +=, which you'll find on collections based on them.