Immutable DataStructures In Scala - scala

We know that Scala supports immutable data structures..i.e each time u update the list it will create a new object and reference in the heap.
Example
val xs:List[Int] = List.apply(22)
val newList = xs ++ (33)
So when i append the second element to a list it will create a new list which will contain both 22 and 33.This exactly works like how immutable String works in Java.
So the question is each time I append a element in the list a new object will be created each time..This ldoes not look efficient to me.
is there some special data structures like persistent data structures are used when dealing with this..Does anyone know about this?

Appending to a list has O(n) complexity and is inefficient. A general approach is to prepend to a list while building it, and finally reverse it.
Now, your question on creating new object still applies to the prepend. Note that since xs is immutable, newList just points to xs for the rest of the data after the prepend.

While #manojlds is correct in his analysis, the original post asked about the efficiency of duplicating list nodes whenever you do an operation.
As #manojlds said, constructing lists often require thinking backwards, i.e., building a list and then reversing it. There are a number of other situations where list building requires "needless" copying.
To that end, there is a mutable data structure available in Scala called ListBuffer which you can use to build up your list and then extract the result as an immutable list:
val xsa = ListBuffer[Int](22)
xsa += 33
val newList = xsa.toList
However, the fact that the list data structure is, in general, immutable means that you have some very useful tools to analyze, de-compose and re-compose the list. Many builtin operations take advantage of the immutability. By extension, your own programs can also take advantage of this immutability.

Related

An efficient and composable foreach in Scala

Goal: Efficiently do something for each element in a list, and then return the original list, so that I can do something else with original list. For example, let lst be a very large list, and suppose we do many operations to it before applying our foreach. What I want to do is something like this:
lst.many_operations().foreach(x => f(x)).something_else()
However, foreach returns a unit. I seek a way to iterate through the list and return the original list supplied, so that I can do something_else() to it. To reduce the memory impact, I need to avoid saving the result of lst.many_operations() to a variable.
An obvious, but imperfect, solution is to replace foreach with map. Then the code looks like:
lst
.many_operations()
.map(x => {
f(x)
x
}).something_else()
However, this is not good because map constructs a new list, effectively duplicating the very large list that it iterated through.
What is the right way to do this in Scala?
The simplest way seems to be:
lst.foreach(many_operations)
lst.foreach(something_else)
However: using side effects is really not a good idea. I would urge you to revisit your design to use explicit pure transformations rather than side effects and mutations.
To address your concern about having multiple lists in memory at the same time, you can use view or iterator to emulate streaming processing, and discard intermediate results you do not need to use again:
val newList = lst.iterator
.map(foo)
.map(bar)
.map(baz)
.toList
(lst will get garbage collected if you do not reference it again).

scala.collection.breakOut vs views

This SO answer describes how scala.collection.breakOut can be used to prevent creating wasteful intermediate collections. For example, here we create an intermediate Seq[(String,String)]:
val m = List("A", "B", "C").map(x => x -> x).toMap
By using breakOut we can prevent the creation of this intermediate Seq:
val m: Map[String,String] = List("A", "B", "C").map(x => x -> x)(breakOut)
Views solve the same problem and in addition access elements lazily:
val m = (List("A", "B", "C").view map (x => x -> x)).toMap
I am assuming the creation of the View wrappers is fairly cheap, so my question is: Is there any real reason to use breakOut over Views?
You're going to make a trip from England to France.
With view: you're taking a set of notes in your notebook and boom, once you've called .force() you start making all of them: buy a ticket, board on the plane, ....
With breakOut: you're departing and boom, you in the Paris looking at the Eiffel tower. You don't remember how exactly you've arrived there, but you did this trip actually, just didn't make any memories.
Bad analogy, but I hope this give you a taste of what is the difference between them.
I don't think views and breakOut are identical.
A breakOut is a CanBuildFrom implementation used to simplify transformation operations by eliminating intermediary steps. E.g get from A to B without the intermediary collection. A breakOut means letting Scala choose the appropriate builder object for maximum efficiency of producing new items in a given scenario. More details here.
views deal with a different type of efficiency, the main sale pitch being: "No more new objects". Views store light references to objects to tackle different usage scenarios: lazy access etc.
Bottom line:
If you map on a view you may still get an intermediary collection of references created before the expected result can be produced. You could still have superior performance from:
collection.view.map(somefn)(breakOut)
Than from:
collection.view.map(someFn)
As of Scala 2.13, this is no longer a concern. Breakout has been removed and views are the recommended replacement.
Scala 2.13 Collections Rework
Views are also the recommended replacement for collection.breakOut.
For example,
val s: Seq[Int] = ...
val set: Set[String] = s.map(_.toString)(collection.breakOut)
can be expressed with the same performance characteristics as:
val s: Seq[Int] = ...
val set = s.view.map(_.toString).to(Set)
What flavian said.
One use case for views is to conserve memory. For example, if you had a million-character-long string original, and needed to use, one by one, all of the million suffixes of that string, you might use a collection of
val v = original.view
val suffixes = v.tails
views on the original string. Then you might loop over the suffixes one by one, using suffix.force() to convert them back to strings within the loop, thus only holding one in memory at a time. Of course, you could do the same thing by iterating with your own loop over the indices of the original string, rather than creating any kind of collection of the suffixes.
Another use-case is when creation of the derived objects is expensive, you need them in a collection (say, as values in a map), but you only will access a few, and you don't know which ones.
If you really have a case where picking between them makes sense, prefer breakOut unless there's a good argument for using view (like those above).
Views require more code changes and care than breakOut, in that you need to add force() where needed. Depending on context, failure to do so is
often only detected at run-time. With breakOut, generally if it
compiles, it's right.
In cases where view does not apply, breakOut
will be faster, since view generation and forcing is skipped.
If you use a debugger, you can inspect the collection contents, which you
can't meaningfully do with a collection of views.

Scala immutable map, when to go mutable?

My present use case is pretty trivial, either mutable or immutable Map will do the trick.
Have a method that takes an immutable Map, which then calls a 3rd party API method that takes an immutable Map as well
def doFoo(foo: String = "default", params: Map[String, Any] = Map()) {
val newMap =
if(someCondition) params + ("foo" -> foo) else params
api.doSomething(newMap)
}
The Map in question will generally be quite small, at most there might be an embedded List of case class instances, a few thousand entries max. So, again, assume little impact in going immutable in this case (i.e. having essentially 2 instances of the Map via the newMap val copy).
Still, it nags me a bit, copying the map just to get a new map with a few k->v entries tacked onto it.
I could go mutable and params.put("bar", bar), etc. for the entries I want to tack on, and then params.toMap to convert to immutable for the api call, that is an option. but then I have to import and pass around mutable maps, which is a bit of hassle compared to going with Scala's default immutable Map.
So, what are the general guidelines for when it is justified/good practice to use mutable Map over immutable Maps?
Thanks
EDIT
so, it appears that an add operation on an immutable map takes near constant time, confirming #dhg's and #Nicolas's assertion that a full copy is not made, which solves the problem for the concrete case presented.
Depending on the immutable Map implementation, adding a few entries may not actually copy the entire original Map. This is one of the advantages to the immutable data structure approach: Scala will try to get away with copying as little as possible.
This kind of behavior is easiest to see with a List. If I have a val a = List(1,2,3), then that list is stored in memory. However, if I prepend an additional element like val b = 0 :: a, I do get a new 4-element List back, but Scala did not copy the orignal list a. Instead, we just created one new link, called it b, and gave it a pointer to the existing List a.
You can envision strategies like this for other kinds of collections as well. For example, if I add one element to a Map, the collection could simply wrap the existing map, falling back to it when needed, all while providing an API as if it were a single Map.
Using a mutable object is not bad in itself, it becomes bad in a functional programming environment, where you try to avoid side-effects by keeping functions pure and objects immutable.
However, if you create a mutable object inside a function and modify this object, the function is still pure if you don't release a reference to this object outside the function. It is acceptable to have code like:
def buildVector( x: Double, y: Double, z: Double ): Vector[Double] = {
val ary = Array.ofDim[Double]( 3 )
ary( 0 ) = x
ary( 1 ) = y
ary( 2 ) = z
ary.toVector
}
Now, I think this approach is useful/recommended in two cases: (1) Performance, if creating and modifying an immutable object is a bottleneck of your whole application; (2) Code readability, because sometimes it's easier to modify a complex object in place (rather than resorting to lenses, zippers, etc.)
In addition to dhg's answer, you can take a look to the performance of the scala collections. If an add/remove operation doesn't take a linear time, it must do something else than just simply copying the entire structure. (Note that the converse is not true: it's not beacuase it takes linear time that your copying the whole structure)
I like to use collections.maps as the declared parameter types (input or return values) rather than mutable or immutable maps. The Collections maps are immutable interfaces that work for both types of implementations. A consumer method using a map really doesn't need to know about a map implementation or how it was constructed. (It's really none of its business anyway).
If you go with the approach of hiding a map's particular construction (be it mutable or immutable) from the consumers who use it then you're still getting an essentially immutable map downstream. And by using collection.Map as an immutable interface you completely remove all the ".toMap" inefficiency that you would have with consumers written to use immutable.Map typed objects. Having to convert a completely constructed map into another one simply to comply to an interface not supported by the first one really is absolutely unnecessary overhead when you think about it.
I suspect in a few years from now we'll look back at the three separate sets of interfaces (mutable maps, immutable maps, and collections maps) and realize that 99% of the time only 2 are really needed (mutable and collections) and that using the (unfortunately) default immutable map interface really adds a lot of unnecessary overhead for the "Scalable Language".

How is immutability practically implemented in the design of Scala applications?

Being new to scala and a current java developer, scala was designed to encourage the use of immutability to class design.
How does this translate practically to the design of classes? The only thing that is brought to my mind is case classes. Are case classes strongly encouraged for defining data? Example? How else is immutability encouraged in Scala design of classes?
As a java developer, classes defining data were mutable. The equivalent Scala classes should be defined as case classes?
Well, case classes certainly help, but the biggest contributor is probably the collection library. The default collections are immutable, and the methods are geared toward manipulating collections by producing new ones instead of mutating. Since the immutable collections are persistent, that doesn't require copying the whole collection, which is something one often has to do in Java.
Beyond that, for-comprehensions are monadic comprehensions, which is helpful in doing immutable tasks, there's tail recursion optimization, which is very important in immutable algorithms, and general attention to immutability in many libraries, such as parser combinators and xml.
Finally, note that you have to ask for a var to get some mutability. Parameters are immutable, and val is just as short as var. Contrast this with Java, where parameters are mutable, and you need to add a final keyword to get immutability. Whereas in Scala it is as easy or easier to stay immutable, in Java it is easier to stay mutable.
Addendum
Persistent data structures are data structures that share parts between modified versions of it. This might be a bit difficult to understand, so let's consider Scala's List, which is pretty basic and easy to understand.
A Scala List is composed of two classes, known as cons and Nil. The former is actually written :: in Scala, but I'll refer to it by the traditional name.
Nil is the empty list. It doesn't contain anything. Methods that depend on the list not being empty, such as head and tail throw exceptions, while others work ok.
Naturally, cons must then represent a non-empty list. In fact, cons has exactly two elements: a value, and a list. These elements are known as head and tail.
So a list with three elements is composed of three cons, since each cons will hold only one value, plus a Nil. It must have a Nil because a cons must point to a list. As lists are not circular, then one of the cons must point to something other than a cons.
One example of such list is this:
val list = 1 :: 2 :: 3 :: Nil
Now, the components of a Scala List are immutable. One cannot change neither the value nor the list of a cons. One benefit of immutability is that you never need to copy the collection before passing or after receiving it from some other method: you know that list cannot change.
Now, let's consider what would happen if I modified that list. Let's consider two modifications: removing the first element and prepending a new element.
We can remove one element with the method tail, whose name is not a coincidence at all. So, we write:
val list2 = list.tail
And list2 will point to the same list that list's tail is pointing. Nothing at all was created: we simply reused part of list. So, let's prepend an element to list2 then:
val list3 = 0 :: list2
We created a new cons there. This new cons has a value (a head) equal to 0, and its tail points to list2. Note that both list and list3 point to the same list2. These elements are being shared by both list and list3.
There are many other persistent data structures. The very fact that the data you are manipulating is immutable makes it easy to share components.
One can find more information about this subject on the book by Chris Okasaki, Purely Functional Data Structures, or on his freely available thesis by the same name.

Difference between MutableList and ListBuffer

What is the difference between Scala's MutableList and ListBuffer classes in scala.collection.mutable? When would you use one vs the other?
My use case is having a linear sequence where I can efficiently remove the first element, prepend, and append. What's the best structure for this?
A little explanation on how they work.
ListBuffer uses internally Nil and :: to build an immutable List and allows constant-time removal of the first and last elements. To do so, it keeps a pointer on the first and last element of the list, and is actually allowed to change the head and tail of the (otherwise immutable) :: class (nice trick allowed by the private[scala] var members of ::). Its toList method returns the normal immutable List in constant time as well, as it can directly return the structure maintained internally. It is also the default builder for immutable Lists (and thus can indeed be reasonably expected to have constant-time append). If you call toList and then again append an element to the buffer, it takes linear time with respect to the current number of elements in the buffer to recreate a new structure, as it must not mutate the exported list any more.
MutableList works internally with LinkedList instead, an (openly, not like ::) mutable linked list implementation which knows of its element and successor (like ::). MutableList also keeps pointers to the first and last element, but toList returns in linear time, as the resulting List is constructed from the LinkedList. Thus, it doesn't need to reinitialize the buffer after a List has been exported.
Given your requirements, I'd say ListBuffer and MutableList are equivalent. If you want to export their internal list at some point, then ask yourself where you want the overhead: when you export the list, and then no overhead if you go on mutating buffer (then go for MutableList), or only if you mutable the buffer again, and none at export time (then go for ListBuffer).
My guess is that in the 2.8 collection overhaul, MutableList predated ListBuffer and the whole Builder system. Actually, MutableList is predominantly useful from within the collection.mutable package: it has a private[mutable] def toLinkedList method which returns in constant time, and can thus efficiently be used as a delegated builder for all structures that maintain a LinkedList internally.
So I'd also recommend ListBuffer, as it may also get attention and optimization in the future than “purely mutable” structures like MutableList and LinkedList.
This gives you an overview of the performance characteristics: http://www.scala-lang.org/docu/files/collections-api/collections.html ; interestingly, MutableList and ListBuffer do not differ there. The documentation of MutableList says it is used internally as base class for Stack and Queue, so maybe ListBuffer is more the official class from the user perspective?
You want a list (why a list?) that is growable and shrinkable, and you want constant append and prepend. Well, Buffer, a trait, has constant append and prepend, with most other operations linear. I'm guessing that ListBuffer, a class that implements Buffer, has constant time removal of the first element.
So, my own recommendation is for ListBuffer.
First, lets go over some of the relevant types in Scala
List - An Immutable collection. A Recursive implementation i.e . i.e An instance of list has two primary elements the head and the tail, where the tail references another List.
List[T]
head: T
tail: List[T] //recursive
LinkedList - A mutable collection defined as a series of linked nodes, where each node contains a value and a pointer to the next node.
Node[T]
value: T
next: Node[T] //sequential
LinkedList[T]
first: Node[T]
List is a functional data structure (immutability) compared to LinkedList which is more standard in imperative languages.
Now, lets look at
ListBuffer - A mutable buffer implementation backed by a List.
MutableList - An implementation based on LinkedList ( Would have been more self explanatory if it had been named LinkedListBuffer instead )
They both offer similar complexity bounds on most operations.
However, if you request a List from a MutableList, then it has to convert the existing linear representation into the recursive representation which takes O(n) which is what #Jean-Philippe Pellet points out. But, if you request a Seq from MutableList the complexity is O(1).
So, IMO the choice narrows down to the specifics of your code and your preference. Though, I suspect there is a lot more List and ListBuffer out there.
Note that ListBuffer is final/sealed, while you can extend MutableList.
Depending on your application, extensibility may be useful.