How is Scala so efficient with lists? - scala

It's usually considered a bad practice to make unnecessary collections in Java as it consumes some memory and CPU. Scala seems to be pretty efficient with it and encourages to use immutable data structures.
How is Scala so efficient with Lists? What techniques are used to achieve that?

While the comments are correct that the claim that list is particularly efficient is a dubious one, it does much better than doing full copies of the collection for every operation like you would do with Java's standard collections.
The reason for this is List and the other immutable collections are not just mutable collections with mutation methods returning a copy, but are designed differently to with immutability in mind. They Take advantage of something called "structural sharing". If parts of a collection remain the same after a change, then those parts don't need to be copied and the same object can be shared across multiple collections. This works because of immutability, there is no change that they could be altered so it's safe to share.
Imagine the simplest example, prepending to a list.
You have a List(1,2,3) and you want to prepend 0
val original = List(1,2,3)
val updated = 0 :: original
You list would then look something like this
updated original
\ \
0 - - - 1 - - - 2 - - - 3
All that's needed is to create a new node and point it's tail to the head of your original list. Nothing needs to be copied. Similarly the tail and drop operations just need to return a reference to the appropriate node and nothing needs to be copied. This is why List can be quite good with the prepend and tail operations, because it doesn't do any copying even though it creates a "new" List.
Other List operations do require some amount copying, but always as little as possible. As long as part of the tail of a list is unchanged it doesn't need to be copied. For example when concatenating lists, the first list needs to be copied, but then it's tail can just point to the head of the second, so the second list doesn't need to be copied at all. This is why, when concatenating a long and short list it's better to put the shorter list on the "left" as it is the only one that needs to be copied.
Other types of collections are better at different operations. Vector for example can to both prepend and append in amortized constant time, as well as having good random access and update capabilities (though still much worse than a raw mutable array). In most cases it will be more efficient than List while still being immutable. It's implementation is quite complicated. It uses a trie datastructure, with many internal arrays to store data. The unchanged ones can be shared and only the ones that need to be altered by an update operation need to be copied.

Related

Is there any benefit of working with an Iterator over a List

Is there any benefit of manipulating an Iterator over or List ?
I need to know if concatenating 2 iterators is better that concatenating to List ?
In a sense what the fundamental difference between working with iterator over the actual collection.
An Iterator isn't an actual data structure, although it behaves similar to one. It is just a traversal pointer to some actual data structure. Thus, unlike in an actual data structure, an Iterator can't "go back," that is, access old elements. Once you've gone through an Iterator, you're done.
What's cool about Iterator is that you can give it a map, filter, or other transformation elements, and instead of actually modifying any existing data structure, it will instead apply the transformation the next time you ask for an element.
"Concatenating" two Iterators creates a new Iterator that wraps both of them.
On the other hand, Lists are actual collections and can be re-traversed.

What goes on behind the scenes when adding immutable collections in Scala?

Working in Scala, I have encounter immutable items, for this example immutable.Map. There are times where code that I do not control (Spark) returns an immutable.Map that I want to process and add elements too. I am using the following approach, because it compiles and runs. I am hoping the computer is smart enough to do this efficiently, but do not believe I should make that assumption.
var map: immutable.Map[Int, Double] = getMapFromSomewhere()
var i = 0
while(i < 5){
map += (i -> 0.0)
i +=1
}
I am hoping that this takes my new map item, places it into memory and does not make a copy of Map, that has to be cleaned up by garbage collection. Should I be creating a mutable.Map from my immutable.Map to do these types of operations instead?
When you "add" to an immutable collection, you are really creating a new Collection, which ideally and usually shares the same memory and data with the old Collection. This is safe, because since the Collections are immutable, you don't need to worry that a change in one will corrupt the other.
Your code is... not so great. That's a terribly ugly style for Scala, and your types are off. (There's no such thing as "immutable.Map[Double]", since Map takes two type parameters. You are building an immutable.Map[Int,Double], I guess.)
Here's a less ugly way to build what you are trying to build:
(0 until 5).map( i => (i, 0.0) ).toMap
or, more precisely, since you may be starting with a nonempty map
getMapFromSomwhere() ++ (0 until 5).map(i =>(i, 0.0))
Reserve mutable data structures for special cases where you really need them, and use them only if you have carefully thought through how you will manage any concurrency or if you can guarantee there will be no concurrent access. Your default in Scala should be immutable datastructures built and manipulated in a functional style, avoiding explicit external iteration of the sort in your example. You should use the keyword "var" only rarely, like mutable datastructures, only for special cases you have thought through carefully.
The data structures in functional programming languages are not just simply immutable(their reference can't be changes once it is created) but also persistent. By persistent means it reuses the existing collection for some of the operations. For example, in Scala prepending an element to the list is optimized(So when you are using list, you should think append operation as kind of pushing an element to stack).
Similarly, other collections are optimized as well for other operations.
I gave you few references that help you to get more understanding on persistent data structures in functional programming.
Persistent data structures in Scala
2.https://www.packtpub.com/mapt/book/application_development/9781783985845/3/ch03lvl1sec25/persistent-data-structures
https://www.youtube.com/watch?v=pNhBQJN44YQ
https://www.youtube.com/watch?v=T0yzrZL1py0

Scala's toList function appears to be slow

I was under the impression that calling seq.toList() on an immutable Seq would be making a new list which is sharing the structural state from the first list. We're finding that this could be really slow and I'm not sure why. It is just sharing the structural state, correct? I can't see why it'd be making an n-time copy of all the elements when it knows they'll never change.
A List in Scala is a particular data structure: instances of :: each containing a value, followed by Nil at the end of the chain.
If you toList a List, it will take O(1) time. If you toList on anything else then it must be converted into a List, which involves O(n) object allocations (all the :: instances).
So you have to ask whether you really want a scala.collection.immutable.List. That's what toList gives you.
Sharing structural state is possible for particular operations on particular data structures.
With the List data structure in Scala, my understanding is that every element refers to the next, starting from the head through the tail, so a singly linked list.
From a structural state sharing perspective, consider the restrictions placed on this from the internal data structure perspective. Adding an element to the head of a List (X) effectively creates a new list (X') with the new element as the head of X' and the old list (X) as the tail. For this particular operation, internal state can be shared completely.
The same operation above can be applied to create a new List (X'), with the new element as the head of X' and any element from X as the tail, as long as you accept that the tail will be the element you choose from X, plus all additional elements it already has in it's data structure.
When you think about it logically, each data structure has an internal structure that allows some operations to be performed with simple shared internal structure and other operations requiring more invasive and costly computations.
The key from my perspective here is having an understanding of the constraints placed on the operations by the internal data structure itself.
For example, consider the same operations above on a doubly linked list data structure and you will see that there are quite different restrictions.
Personally, I find drawing out an understanding of the internal structure can be helpful in understanding the consequences of particular operations.
In the case of the toList operation on an arbitrary sequence, with no knowledge of the arbitrary sequences internal data structure, one has to therefore assume O(n). List.toList has the obvious performance advantage of already being a list.

Which scala mutable list to use?

This is a followup question to No Scala mutable list
I want to use a mutable list in Scala. I can chose from
scala.collection.mutable.DoubleLinkedList
scala.collection.mutable.LinkedList
scala.collection.mutable.ListBuffer
scala.collection.mutable.MutableList
Which is nice, but what is the "standard", recommended, idiomatic scala way? I just want to use a list that I can add things to on the back.
In my case, I am using a HashMap, where the "lists" (I am meaning it in general sense) will be on value side. Then, I am reading something from a file and for every line, I want to find the right list in the hashmap and append the value to the list.
Depends what you need.
DoubleLinkedList is a linked list which allows you to traverse back-and-forth through the list of nodes. Use its prev and next references to go to the previous or the next node, respectively.
LinkedList is a singly linked list, so there are not prev pointers - if you only traverse to the next element of the list all the time, this is what you need.
EDIT: Note that the two above are meant to be used internally as building blocks for more complicated list structures like MutableLists which support efficient append, and mutable.Queues.
The two collections above both have linear-time append operations.
ListBuffer is a buffer class. Although it is backed by a singly linked list data structure, it does not expose the next pointer to the client, so you can only traverse it using iterators and the foreach.
Its main use is, however, as a buffer and an immutable list builder - you append elements to it via +=, and when you call result, you very efficiently get back a functional immutable.List. Unlike mutable and immutable lists, both append and prepend operations are constant-time - you can append at the end via += very efficiently.
MutableList is used internally, you usually do not use it unless you plan to implement a custom collection class based on the singly linked list data structure. Mutable queues, for example, inherit this class. MutableList class also has an efficient constant-time append operation, because it maintains a reference to the last node in the list.
The documentation's Concrete Mutable Collection Classes page (or the one for 2.12) has an overview of mutable list classes, including explanations on when to use which one.
If you want to append items you shouldn't use a List at all. Lists are good when you want to prepend items. Use ArrayBuffer instead.
I just want to use a list that I can add things to on the back.
Then choose something that implements Growable. I personally suggest one of the Buffer implementations.
I stay away from LinkedList and DoubleLinkedList, as they are present mainly as underlying implementation of other collections, but have quite a few bugs up to Scala 2.9.x. Starting with Scala 2.10.0, I expect the various bug fixes have brought them up to standard. Still, they lack some methods people expect, such as +=, which you'll find on collections based on them.

Why no immutable double linked list in Scala collections?

Looking at this question, where the questioner is interested in the first and last instances of some element in a List, it seems a more efficient solution would be to use a DoubleLinkedList that could search backwards from the end of the list. However there is only one implementation in the collections API and it's mutable.
Why is there no immutable version?
Because you would have to copy the whole list each time you want to make a change. With a normal linked list, you can at least prepend to the list without having to copy everything. And if you do want to copy everything on every change, you don't need a linked list for that. You can just use an immutable array.
There are many impediments to such a structure, but one is very pressing: a doubly linked list cannot be persistent.
The logic behind this is pretty simple: from any node on the list, you can reach any other node. So, if I added an element X to this list DL, and tried to use a part of DL, I'd face this contradiction: from the node pointing to X one can reach every element in part(DL), but, by the properties of the doubly linked list, that means from any element of part(DL) I can reach the node pointing to X. Since part(DL) is supposed to be immutable and part of DL, and since DL did not include the node pointing to X, that just cannot be.
Non-persistent immutable data structures might have some uses, but they are generally bad for most operations, since they need to be recreated whenever a derivative is produced.
Now, there's the minor matter of creating mutually referencing strict objects, but this is surmountable. One can use by-name parameters and lazy vals, or one can do like Scala's List: actually create a mutable collection, and then "freeze" it in immutable state (see ListBuffer and it's toList method).
Because it is logically impossible to create a mutually (circular) referential data-structure with strict immutability.
You cannot create two nodes that point to each other due to simple existential ordering priority, in that at least one of the nodes will not exist when the other is created.
It is possible to get this circularity with tricks involving laziness (which is implemented with mutation), but the real question then becomes why you would want this thing in the first place?
As others have noted, there is no persistent implementation of a double-linked list. You will need some kind of tree to get close to the characteristics you want.
In particular, you may want to look at finger trees, which provide O(1) access to the front and back, amortized O(1) insertion to the front and back, and O(log n) insertion elsewhere. (That's in contrast to most other commonly-used trees which have O(log n) access and insertion everywhere.)
See also:
video explanation of finger trees (by the implementor of finger trees in clojure.contrib)
finger tree implementation in Scala (I haven't used it personally, but it's the top google hit)
As a supplemental to the answer of #KimStebel I like to add:
If you are searching for a data structure suitable for the question that motivated you to ask this question, then you might have a look at Extreme Cleverness: Functional Data Structures in Scala by #DanielSpiewak.