What is the prefered way in using the parallel collections in Scala? - scala

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?

Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).

Related

What are Builder, Combiner, and Splitter in scala?

In the parallel programming course from EPFL, four abstractions for data parallelism are mentioned: Iterator, Builder, Combiner, and Splitter.
I am familiar with Iterator, but have never used the other three. I have seen other traits Builder, Combiner, and Splitter under scala.collection package. However, I have idea how to use them in real-world development, particularly how to use them in collaboration with other collections like List, Array, ParArray, etc. Could anyone please give me some guidance and examples?
Thanks!
The two traits Iterator and Builder are not specific to parallelism, however, they provide the basis for Combiner and Splitter.
You already know that an Iterator can help you with iterating over sequential collections by providing the methods hasNext and next. A Splitter is a special case of an Iterator and helps to partition a collection into multiple disjoint subsets. The idea is that after the splitting, these subsets can be processed in parallel. You can obtain a Splitter from a parallel collection by invoking .splitter on it. The two important methods of the Splitter trait are as follows:
remaining: Int: returns the number of elements of the current collection, or at least an approximation of that number. This information is important, since it is used to decide whether or not it's worth it to split the collection. If your collection contains only a small amount of elements, then you want to process these elements sequentially instead of splitting the collection into even smaller subsets.
split: Seq[Splitter[A]]: the method that actually splits the current collection. It returns the disjoint subsets (represented as Splitters), which recursively can be splitted again if it's worth it. If the subsets are small enough, they finally can be processed (e.g. filtered or mapped).
Builders are used internally to create new (sequential) collections. A Combiner is a special case of a Builder and at the same time represents the counterpart to Splitter. While a Splitter splits your collection before it is processed in parallel, a Combiner puts together the results afterwards. You can obtain a Combiner from a parallel collection (subset) by invoking .newCombiner on it. This is done via the following method:
combine(that: Combiner[A, B]): Combiner[A, B]: combines your current collection with another collection by "merging" both Combiners. The result is a new Combiner, which either represents the final result, or gets combined again with another subset (by the way: the type parameters A and B represent the element type and the type or the resulting collection).
The thing is that you don't need to implement or even use these methods directly if you don't define a new parallel collection. The idea is that people implementing new parallel collections only need to define splitters and combiners and get a whole bunch of other operations for free, because those operations are already implemented and make use of splitters and combiners.
Of course this is only a superficial description of how those things work. For further reading, I recommend reading Architecture of the Parallel Collections Library as well as Creating Custom Parallel Collections.

Seq, SeqLike, GenSeq or GenSeqLike?

When a create a function, should I have it take as an argument Seq, SeqLike, GenSeq, or GenSeqLike? (So many choices!)
My only requirements is that I can map over it and produce a collection with the same number and order of arguments as before.
Typically I "program to interfaces" and choose the most general type possible. In this case, that would be a GenSeqLike.
Is this correct/idiomatic?
SeqLike is just an implementation layer for Seq that allows you to specify return types. There are extremely few things that are SeqLike but not Seq, and those are arguably an error. So you can feel comfortable not worrying about the -Likes. (If you want to build new collections of the type you are given and keep the types straight, use CanBuildFrom instead.)
So then the question is whether to use GenSeq or Seq. The problem with GenSeq is that the processing might be done in parallel, which means you have to avoid using any operation where that could violate your expectations (e.g. summing with a foreach). Furthermore, the general consensus seems to be that the GenX part of the collections hierarchy overcomplicates the collections and makes it more difficult to incorporate alternative choices of parallel collections. So my recommendation would be Seq unless you are pretty sure that you have use-cases where you'd like parallel processing. If you simply don't care, Seq is simpler to reason about for you and for users of the function.

Efficient way to build collections from other collections

In Scala, as in many other languages, it is possible to build collections using the elements contained in other collections.
For example, it is possible to heapify a list:
import scala.collection.mutable.PriorityQueue
val l = List(1,2,3,4)
With:
val pq = PriorityQueue(l:_*)
or:
val pq = PriorityQueue[Int]() ++ l
These are, from my point of view, two quite different approaches:
Use a variadic constructor and collection:_* which, at the end of the day, dumps the collection in an intermediate array.
Build an empty target collection and use the ++ method to add all the source collection elements.
From an aesthetic point of view I do prefer the first option but I am worried about collection:_*. I understand form "Programming In Scala" that variadic functions are translated into functions receiving an array.
Is it, in general, the second option a better solution in terms of efficiency?
The second one might be faster in some cases, but apparently when the original collection is a Seq (such as your List), Scala tries to avoid the array creation; see here.
But, realistically, it probably will not ever make a difference anyway unless you are dealing with huge collections in tight loops. These kinds of things aren't worth worrying about, so do whichever one you like; you can spare the milliseconds.

Partially sorting collections in Scala

I am trying to sort a collection of linked-list nodes. The collection contains nodes from more than one linked list; ordering must be maintained within each list, but ordering across lists does not matter.
PartialOrdering[T] seems like the natural choice, but I cannot find any standard functions within Scala that support it (e.g. .sort only takes Ordering[T]).
I've considered wrapping the former type into the latter, but realise this will actually produce erroneous results. Partial ordering cannot be abstracted-away like this as the underlying sort algorithm needs the additional information to produce correct results.
I would like to represent the elements as a SortedSet - is anyone aware of anything that can get me close?

Asymptotic behaviour of Scala methods

Is there somewhere I can find out the expected time and space complexities of operations on collections like HashSet, TreeSet, List and so on?
Is one just expected to know these from the properties of the abstract-data-types themselves?
I know of Performance characteristics for Scala collections, but this only mentions some very basic operations. Perhaps the rest of the operations for these collections are built purely from a small base-set, but then, it seems I am just expected to know that they have implemented them in this way?
The guide for the other methods should be - just think what an efficient implementation should look like.
Most other bulk-operations on collections (operations that process each element in the collection) are O(n), so they are not mentioned there. Examples are filter, map, foreach, indexOf, reverse, find ...
Methods returning iterators or streams like combinations and permutations are usually O(1).
Methods involving 2 collections are usually O(max(n, m)) or O(min(n, m)). These are zip, zipAll, sameElements, corresponds, ...
Methods union, diff, and intersect are O(n + m).
Sort variants are, naturally, O(nlogn). The groupBy is O(nlogn) in the current implementation. The indexOfSlice uses the KMP algorithm and is O(m + n), where m and n are lengths of the strings.
Methods such as +:, :+ or patch are generally O(n) as well, unless you are dealing with a specific case of an immutable collection for which the operation in question is more efficient - for example, prepending an element on a functional List or appending an element to a Vector.
Methods toX are generally O(n), as they have to iterate all the elements and create a new collection. An exception is toStream which builds the collection lazily - so it's O(1). Also, whenever X is the type of the collection toX just returns this, being O(1).
Iterator implementations should have an O(1) (amortized) next and hasNext operations. Iterator creation should be worst-case O(logn), but O(1) in most cases.
Performance characteristics of the other methods is really difficult to assert. Consider the following:
These methods are all implemented based on foreach or iterator, and at usually very high levels in the hierachy. Vector's map is implemented on collection.TraversableLike, for example.
To add insult to injury, which method implementation is used depends on the linearization of the class inheritance. This also applies to any method called as a helper. It has happened before that changes here caused unforeseen performance problems.
Since foreach and iterator are both O(n), any improved performance depends on specialization at other methods, such as size and slice.
For many of them, there's further dependency on the performance characteristics of the builder that was provided, which depends on the call site instead of the definition site.
So the result is that the place where the method is defined -- and documented -- does not have near enough information to state its performance characteristics, and may depend not only on how other methods are implemented by the inheriting collection, but even by the performance characteristics of an object, Builder, obtained from CanBuildFrom, that is passed at the call site.
At best, any such documentation would be described in terms of other methods. Which doesn't mean it isn't worthwhile, but it isn't easily done -- and hard tasks on open source projects depend on volunteers, who usually work at what they like, not what is needed.