What are Builder, Combiner, and Splitter in scala? - scala

In the parallel programming course from EPFL, four abstractions for data parallelism are mentioned: Iterator, Builder, Combiner, and Splitter.
I am familiar with Iterator, but have never used the other three. I have seen other traits Builder, Combiner, and Splitter under scala.collection package. However, I have idea how to use them in real-world development, particularly how to use them in collaboration with other collections like List, Array, ParArray, etc. Could anyone please give me some guidance and examples?
Thanks!

The two traits Iterator and Builder are not specific to parallelism, however, they provide the basis for Combiner and Splitter.
You already know that an Iterator can help you with iterating over sequential collections by providing the methods hasNext and next. A Splitter is a special case of an Iterator and helps to partition a collection into multiple disjoint subsets. The idea is that after the splitting, these subsets can be processed in parallel. You can obtain a Splitter from a parallel collection by invoking .splitter on it. The two important methods of the Splitter trait are as follows:
remaining: Int: returns the number of elements of the current collection, or at least an approximation of that number. This information is important, since it is used to decide whether or not it's worth it to split the collection. If your collection contains only a small amount of elements, then you want to process these elements sequentially instead of splitting the collection into even smaller subsets.
split: Seq[Splitter[A]]: the method that actually splits the current collection. It returns the disjoint subsets (represented as Splitters), which recursively can be splitted again if it's worth it. If the subsets are small enough, they finally can be processed (e.g. filtered or mapped).
Builders are used internally to create new (sequential) collections. A Combiner is a special case of a Builder and at the same time represents the counterpart to Splitter. While a Splitter splits your collection before it is processed in parallel, a Combiner puts together the results afterwards. You can obtain a Combiner from a parallel collection (subset) by invoking .newCombiner on it. This is done via the following method:
combine(that: Combiner[A, B]): Combiner[A, B]: combines your current collection with another collection by "merging" both Combiners. The result is a new Combiner, which either represents the final result, or gets combined again with another subset (by the way: the type parameters A and B represent the element type and the type or the resulting collection).
The thing is that you don't need to implement or even use these methods directly if you don't define a new parallel collection. The idea is that people implementing new parallel collections only need to define splitters and combiners and get a whole bunch of other operations for free, because those operations are already implemented and make use of splitters and combiners.
Of course this is only a superficial description of how those things work. For further reading, I recommend reading Architecture of the Parallel Collections Library as well as Creating Custom Parallel Collections.

Related

What is the fastest way to join two iterables (or sequences) in Scala?

Say I want to create an Iterable by joining multiple Iterables. What would be the fastest way to do so?
Documentation for ListBuffer's ++= says that it
Appends all elements produced by a TraversableOnce to this list buffer.
which sounds like it appends the elements one by one and hence should take Ω(size of iterable to append) time. I want something that joins the two Iterables in O(1) time. Does ::: method of List runs in O(1) time? Documentation simply states that it
Adds the elements of a given list in front of this list.
I also checked out performance characteristics of scala collections but it says nothing about joining two collections.
You can create a Stream:
Stream(a, b, c).flatten
or concatenate iterators (this gives you Stream at runtime anyway)
(a.iterator ++ b.iterator ++ c.iterator).toIterable
(assuming that a, b and c are Iterable[Int])
That will give you something along the lines of O(number of collections to join)
Streams are evaluated on demand, so only little memory allocation will happen for closures until you actually request any elements. Converting to anything else is unavoidably O(n).

Efficient way to build collections from other collections

In Scala, as in many other languages, it is possible to build collections using the elements contained in other collections.
For example, it is possible to heapify a list:
import scala.collection.mutable.PriorityQueue
val l = List(1,2,3,4)
With:
val pq = PriorityQueue(l:_*)
or:
val pq = PriorityQueue[Int]() ++ l
These are, from my point of view, two quite different approaches:
Use a variadic constructor and collection:_* which, at the end of the day, dumps the collection in an intermediate array.
Build an empty target collection and use the ++ method to add all the source collection elements.
From an aesthetic point of view I do prefer the first option but I am worried about collection:_*. I understand form "Programming In Scala" that variadic functions are translated into functions receiving an array.
Is it, in general, the second option a better solution in terms of efficiency?
The second one might be faster in some cases, but apparently when the original collection is a Seq (such as your List), Scala tries to avoid the array creation; see here.
But, realistically, it probably will not ever make a difference anyway unless you are dealing with huge collections in tight loops. These kinds of things aren't worth worrying about, so do whichever one you like; you can spare the milliseconds.

Asymptotic behaviour of Scala methods

Is there somewhere I can find out the expected time and space complexities of operations on collections like HashSet, TreeSet, List and so on?
Is one just expected to know these from the properties of the abstract-data-types themselves?
I know of Performance characteristics for Scala collections, but this only mentions some very basic operations. Perhaps the rest of the operations for these collections are built purely from a small base-set, but then, it seems I am just expected to know that they have implemented them in this way?
The guide for the other methods should be - just think what an efficient implementation should look like.
Most other bulk-operations on collections (operations that process each element in the collection) are O(n), so they are not mentioned there. Examples are filter, map, foreach, indexOf, reverse, find ...
Methods returning iterators or streams like combinations and permutations are usually O(1).
Methods involving 2 collections are usually O(max(n, m)) or O(min(n, m)). These are zip, zipAll, sameElements, corresponds, ...
Methods union, diff, and intersect are O(n + m).
Sort variants are, naturally, O(nlogn). The groupBy is O(nlogn) in the current implementation. The indexOfSlice uses the KMP algorithm and is O(m + n), where m and n are lengths of the strings.
Methods such as +:, :+ or patch are generally O(n) as well, unless you are dealing with a specific case of an immutable collection for which the operation in question is more efficient - for example, prepending an element on a functional List or appending an element to a Vector.
Methods toX are generally O(n), as they have to iterate all the elements and create a new collection. An exception is toStream which builds the collection lazily - so it's O(1). Also, whenever X is the type of the collection toX just returns this, being O(1).
Iterator implementations should have an O(1) (amortized) next and hasNext operations. Iterator creation should be worst-case O(logn), but O(1) in most cases.
Performance characteristics of the other methods is really difficult to assert. Consider the following:
These methods are all implemented based on foreach or iterator, and at usually very high levels in the hierachy. Vector's map is implemented on collection.TraversableLike, for example.
To add insult to injury, which method implementation is used depends on the linearization of the class inheritance. This also applies to any method called as a helper. It has happened before that changes here caused unforeseen performance problems.
Since foreach and iterator are both O(n), any improved performance depends on specialization at other methods, such as size and slice.
For many of them, there's further dependency on the performance characteristics of the builder that was provided, which depends on the call site instead of the definition site.
So the result is that the place where the method is defined -- and documented -- does not have near enough information to state its performance characteristics, and may depend not only on how other methods are implemented by the inheriting collection, but even by the performance characteristics of an object, Builder, obtained from CanBuildFrom, that is passed at the call site.
At best, any such documentation would be described in terms of other methods. Which doesn't mean it isn't worthwhile, but it isn't easily done -- and hard tasks on open source projects depend on volunteers, who usually work at what they like, not what is needed.

With parallel collection, does aggregate respect order?

in scala, i have a parallel Iterable of items and i want to iterate over them and aggregate the results in some way, but in order. i'll simplify my use case and say that we start with an Iterable of integers and want to concatenate the string representation of them in paralle, with the result in order.
is this possible with either fold or aggregate? it's unclear from the documentation which methods work parallelized but maintain order.
Yes, order is gauranteed to be preserved for fold/aggregate/reduce operations on parallel collections. This is not very well documented. The trick is that the operation you which to fold over must be associative (and thus capable of being arbitrarily split up and recombined), but need not be commutative (and so not capable of being safely reordered). String concatenation is a perfect example of an associative, non-commutative operation, so the fold can be done in parallel.
val concat = myParallelList.map(_.toString).reduce(_+_)
For folds: foldRight and foldLeft cannot be processed in parallel, you'll need to use the new fold method (more info there).
Like fold, aggregate can do its work in parallel: it “traverses the elements in different partitions sequentially” (Scaladoc), though it looks like you have no direct influence on how the partitions are chosen.
I THINK the preservation of 'order' in the sense of the comment to Jean-Philippe Pellets answer is guaranteed due to the way parallel collections are implemented according to a publication of Odersky (http://infoscience.epfl.ch/record/150220/files/pc.pdf) IFF the part that splits your collection is behaving well with respect to order.
i.e. if you have elements a < b < c and a and c end up in one partition it follows that b is in the same partition as well.
I don't remember what exactly was the part responsible for the splitting, but if you find it, you might sufficient information in its documentation or source code in order to answer your question.

What is the prefered way in using the parallel collections in Scala?

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).