Efficient way to build collections from other collections - scala

In Scala, as in many other languages, it is possible to build collections using the elements contained in other collections.
For example, it is possible to heapify a list:
import scala.collection.mutable.PriorityQueue
val l = List(1,2,3,4)
With:
val pq = PriorityQueue(l:_*)
or:
val pq = PriorityQueue[Int]() ++ l
These are, from my point of view, two quite different approaches:
Use a variadic constructor and collection:_* which, at the end of the day, dumps the collection in an intermediate array.
Build an empty target collection and use the ++ method to add all the source collection elements.
From an aesthetic point of view I do prefer the first option but I am worried about collection:_*. I understand form "Programming In Scala" that variadic functions are translated into functions receiving an array.
Is it, in general, the second option a better solution in terms of efficiency?

The second one might be faster in some cases, but apparently when the original collection is a Seq (such as your List), Scala tries to avoid the array creation; see here.
But, realistically, it probably will not ever make a difference anyway unless you are dealing with huge collections in tight loops. These kinds of things aren't worth worrying about, so do whichever one you like; you can spare the milliseconds.

Related

What goes on behind the scenes when adding immutable collections in Scala?

Working in Scala, I have encounter immutable items, for this example immutable.Map. There are times where code that I do not control (Spark) returns an immutable.Map that I want to process and add elements too. I am using the following approach, because it compiles and runs. I am hoping the computer is smart enough to do this efficiently, but do not believe I should make that assumption.
var map: immutable.Map[Int, Double] = getMapFromSomewhere()
var i = 0
while(i < 5){
map += (i -> 0.0)
i +=1
}
I am hoping that this takes my new map item, places it into memory and does not make a copy of Map, that has to be cleaned up by garbage collection. Should I be creating a mutable.Map from my immutable.Map to do these types of operations instead?
When you "add" to an immutable collection, you are really creating a new Collection, which ideally and usually shares the same memory and data with the old Collection. This is safe, because since the Collections are immutable, you don't need to worry that a change in one will corrupt the other.
Your code is... not so great. That's a terribly ugly style for Scala, and your types are off. (There's no such thing as "immutable.Map[Double]", since Map takes two type parameters. You are building an immutable.Map[Int,Double], I guess.)
Here's a less ugly way to build what you are trying to build:
(0 until 5).map( i => (i, 0.0) ).toMap
or, more precisely, since you may be starting with a nonempty map
getMapFromSomwhere() ++ (0 until 5).map(i =>(i, 0.0))
Reserve mutable data structures for special cases where you really need them, and use them only if you have carefully thought through how you will manage any concurrency or if you can guarantee there will be no concurrent access. Your default in Scala should be immutable datastructures built and manipulated in a functional style, avoiding explicit external iteration of the sort in your example. You should use the keyword "var" only rarely, like mutable datastructures, only for special cases you have thought through carefully.
The data structures in functional programming languages are not just simply immutable(their reference can't be changes once it is created) but also persistent. By persistent means it reuses the existing collection for some of the operations. For example, in Scala prepending an element to the list is optimized(So when you are using list, you should think append operation as kind of pushing an element to stack).
Similarly, other collections are optimized as well for other operations.
I gave you few references that help you to get more understanding on persistent data structures in functional programming.
Persistent data structures in Scala
2.https://www.packtpub.com/mapt/book/application_development/9781783985845/3/ch03lvl1sec25/persistent-data-structures
https://www.youtube.com/watch?v=pNhBQJN44YQ
https://www.youtube.com/watch?v=T0yzrZL1py0

What is the fastest way to join two iterables (or sequences) in Scala?

Say I want to create an Iterable by joining multiple Iterables. What would be the fastest way to do so?
Documentation for ListBuffer's ++= says that it
Appends all elements produced by a TraversableOnce to this list buffer.
which sounds like it appends the elements one by one and hence should take Ω(size of iterable to append) time. I want something that joins the two Iterables in O(1) time. Does ::: method of List runs in O(1) time? Documentation simply states that it
Adds the elements of a given list in front of this list.
I also checked out performance characteristics of scala collections but it says nothing about joining two collections.
You can create a Stream:
Stream(a, b, c).flatten
or concatenate iterators (this gives you Stream at runtime anyway)
(a.iterator ++ b.iterator ++ c.iterator).toIterable
(assuming that a, b and c are Iterable[Int])
That will give you something along the lines of O(number of collections to join)
Streams are evaluated on demand, so only little memory allocation will happen for closures until you actually request any elements. Converting to anything else is unavoidably O(n).

What are Builder, Combiner, and Splitter in scala?

In the parallel programming course from EPFL, four abstractions for data parallelism are mentioned: Iterator, Builder, Combiner, and Splitter.
I am familiar with Iterator, but have never used the other three. I have seen other traits Builder, Combiner, and Splitter under scala.collection package. However, I have idea how to use them in real-world development, particularly how to use them in collaboration with other collections like List, Array, ParArray, etc. Could anyone please give me some guidance and examples?
Thanks!
The two traits Iterator and Builder are not specific to parallelism, however, they provide the basis for Combiner and Splitter.
You already know that an Iterator can help you with iterating over sequential collections by providing the methods hasNext and next. A Splitter is a special case of an Iterator and helps to partition a collection into multiple disjoint subsets. The idea is that after the splitting, these subsets can be processed in parallel. You can obtain a Splitter from a parallel collection by invoking .splitter on it. The two important methods of the Splitter trait are as follows:
remaining: Int: returns the number of elements of the current collection, or at least an approximation of that number. This information is important, since it is used to decide whether or not it's worth it to split the collection. If your collection contains only a small amount of elements, then you want to process these elements sequentially instead of splitting the collection into even smaller subsets.
split: Seq[Splitter[A]]: the method that actually splits the current collection. It returns the disjoint subsets (represented as Splitters), which recursively can be splitted again if it's worth it. If the subsets are small enough, they finally can be processed (e.g. filtered or mapped).
Builders are used internally to create new (sequential) collections. A Combiner is a special case of a Builder and at the same time represents the counterpart to Splitter. While a Splitter splits your collection before it is processed in parallel, a Combiner puts together the results afterwards. You can obtain a Combiner from a parallel collection (subset) by invoking .newCombiner on it. This is done via the following method:
combine(that: Combiner[A, B]): Combiner[A, B]: combines your current collection with another collection by "merging" both Combiners. The result is a new Combiner, which either represents the final result, or gets combined again with another subset (by the way: the type parameters A and B represent the element type and the type or the resulting collection).
The thing is that you don't need to implement or even use these methods directly if you don't define a new parallel collection. The idea is that people implementing new parallel collections only need to define splitters and combiners and get a whole bunch of other operations for free, because those operations are already implemented and make use of splitters and combiners.
Of course this is only a superficial description of how those things work. For further reading, I recommend reading Architecture of the Parallel Collections Library as well as Creating Custom Parallel Collections.

Scala: var List vs val MutableList

In Odersky et al's Scala book, they say use lists. I haven't read the book cover to cover but all the examples seem to use val List. As I understand it one also is encouraged to use vals over vars. But in most applications is there not a trade off between using a var List or a val MutableList?. Obviously we use a val List when we can. But is it good practice to be using a lot of var Lists (or var Vectors etc)?
I'm pretty new to Scala coming from C#. There I had a lot of:
public List<T> myList {get; private set;}
collections which could easily have been declared as vals if C# had immutability built in, because the collection itself never changed after construction, even though elements would be added and subtracted from the collection in its life time. So declaring a var collection almost feels like a step away from immutability.
In response to answers and comments, one of the strong selling points of Scala is: that it can have many benefits without having to completely change the way one writes code as is the case with say Lisp or Haskell.
Is it good practice to be using a lot of var Lists (or var Vectors
etc)?
I would say it's better practice to use var with immutable collections than it is to use val with mutable ones. Off the top of my head, because
You have more guarantees about behaviour: if your object has a mutable list, you never know if some other external object is going to update it
You limit the extent of mutability; methods returning a collection will yield an immutable one, so you only have mutablility within your one object
It's easy to immutabilize a var by simply assigning it to a val, whereas to make a mutable collection immutable you have to use a different collection type and rebuild it
In some circumstances, such as time-dependent applications with extensive I/O, the simplest solution is to use mutable state. And in some circumstances, a mutable solution is just more elegant. However in most code you don't need mutability at all. The key is to use collections with higher order functions instead of looping, or recursion if a suitable function doesn't exist. This is simpler than it sounds. You just need to spend some time getting to know the methods on List (and other collections, which are mostly the same). The most important ones are:
map: applies your supplied function to each element in the collection - use instead of looping and updating values in an array
foldLeft: returns a single result from a collection - use instead of looping and updating an accumulator variable
for-yield expressions: simplify your mapping and filtering especially for nested-loop type problems
Ultimately, much of functional programming is a consequence of immutability and you can't really have one without the other; however, local vars are mostly an implementation detail: there's nothing wrong with a bit of mutability so long as it cannot escape from the local scope. So use vars with immutable collections since the local vars are not what will be exported.
You are assuming either the List must be mutable, or whatever is pointing to it must be mutable. In other words, that you need to pick one of the two choices below:
val list: collection.mutable.LinkedList[T]
var list: List[T]
That is a false dichotomy. You can have both:
val list: List[T]
So, the question you ought to be asking is how do I do that?, but we can only answer that when you try it out and face a specific problem. There's no generic answer that will solve all your problems (well, there is -- monads and recursion -- but that's too generic).
So... give it a try. You might be interested in looking at Code Review, where most Scala questions pertain precisely how to make some code more functional. But, ultimately, I think the best way to go about it is to try, and resort to Stack Overflow when you just can't figure out some specific problem.
Here is how I see this problem of mutability in functional programming.
Best solution: Values are best, so the best in functional programming usage is values and recursive functions:
val myList = func(4);
def func(n) = if (n>0) n::func(n) else Nil
Need mutable stuff: Sometimes mutable stuff is needed or makes everything a lot easier. My impression when we face this situation is to use the mutables structures, so to use val list: collection.mutable.LinkedList[T] instead of var list: List[T], this is not because of a real improvement on performances but because of mutable functions which are already defined in the mutable collection.
This advice is personal and maybe not recommended when you want performance but it is a guideline I use for daily programming in scala.
I believe you can't separate the question of mutable val / immutable var from the specific use case. Let me deepen a bit: there are two questions you want to ask yourself:
How am I exposing my collection to the outside?
I want a snapshot of the current state, and this snapshot should not change regardless of the changes that are made to the entity hosting the collection. In such case, you should prefer immutable var. The reason is that the only way to do so with a mutable val is through a defensive copy.
I want a view on the state, that should change to reflect changes to the original object state. In this case, you should opt for an immutable val, and return it wrapped through an unmodifiable wrapper (much like what Collections.unmodifiableList() does in Java, here is a question where it's asked how to do so in Scala). I see no way to achieve this with an immutable var, so I believe your choice here is mandatory.
I only care about the absence of side effects. In this case the two approaches are very similar. With an immutable var you can directly return the internal representation, so it is slightly clearer maybe.
How am I modifying my collection?
I usually make bulk changes, namely, I set the whole collection at once. This makes immutable var a better choice: you can just assign what's in input to your current state, and you are fine already. With an immutable val, you need to first clear your collection, then copy the new contents in. Definitely worse.
I usually make pointwise changes, namely, I add/remove a single element (or few of them) to/from the collection. This is what I actually see most of the time, the collection being just an implementation detail and the trait only exposing methods for the pointwise manipulation of its status. In this case, a mutable val may be generally better performance-wise, but in case this is an issue I'd recommend taking a look at Scala's collections performance.
If it is necessary to use var lists, why not? To avoid problems you could for example limit the scope of the variable. There was a similar question a while ago with a pretty good answer: scala's mutable and immutable set when to use val and var.

What is the prefered way in using the parallel collections in Scala?

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).