Seq, SeqLike, GenSeq or GenSeqLike? - scala

When a create a function, should I have it take as an argument Seq, SeqLike, GenSeq, or GenSeqLike? (So many choices!)
My only requirements is that I can map over it and produce a collection with the same number and order of arguments as before.
Typically I "program to interfaces" and choose the most general type possible. In this case, that would be a GenSeqLike.
Is this correct/idiomatic?

SeqLike is just an implementation layer for Seq that allows you to specify return types. There are extremely few things that are SeqLike but not Seq, and those are arguably an error. So you can feel comfortable not worrying about the -Likes. (If you want to build new collections of the type you are given and keep the types straight, use CanBuildFrom instead.)
So then the question is whether to use GenSeq or Seq. The problem with GenSeq is that the processing might be done in parallel, which means you have to avoid using any operation where that could violate your expectations (e.g. summing with a foreach). Furthermore, the general consensus seems to be that the GenX part of the collections hierarchy overcomplicates the collections and makes it more difficult to incorporate alternative choices of parallel collections. So my recommendation would be Seq unless you are pretty sure that you have use-cases where you'd like parallel processing. If you simply don't care, Seq is simpler to reason about for you and for users of the function.

Related

Efficient way to build collections from other collections

In Scala, as in many other languages, it is possible to build collections using the elements contained in other collections.
For example, it is possible to heapify a list:
import scala.collection.mutable.PriorityQueue
val l = List(1,2,3,4)
With:
val pq = PriorityQueue(l:_*)
or:
val pq = PriorityQueue[Int]() ++ l
These are, from my point of view, two quite different approaches:
Use a variadic constructor and collection:_* which, at the end of the day, dumps the collection in an intermediate array.
Build an empty target collection and use the ++ method to add all the source collection elements.
From an aesthetic point of view I do prefer the first option but I am worried about collection:_*. I understand form "Programming In Scala" that variadic functions are translated into functions receiving an array.
Is it, in general, the second option a better solution in terms of efficiency?
The second one might be faster in some cases, but apparently when the original collection is a Seq (such as your List), Scala tries to avoid the array creation; see here.
But, realistically, it probably will not ever make a difference anyway unless you are dealing with huge collections in tight loops. These kinds of things aren't worth worrying about, so do whichever one you like; you can spare the milliseconds.

Partially sorting collections in Scala

I am trying to sort a collection of linked-list nodes. The collection contains nodes from more than one linked list; ordering must be maintained within each list, but ordering across lists does not matter.
PartialOrdering[T] seems like the natural choice, but I cannot find any standard functions within Scala that support it (e.g. .sort only takes Ordering[T]).
I've considered wrapping the former type into the latter, but realise this will actually produce erroneous results. Partial ordering cannot be abstracted-away like this as the underlying sort algorithm needs the additional information to produce correct results.
I would like to represent the elements as a SortedSet - is anyone aware of anything that can get me close?

Asymptotic behaviour of Scala methods

Is there somewhere I can find out the expected time and space complexities of operations on collections like HashSet, TreeSet, List and so on?
Is one just expected to know these from the properties of the abstract-data-types themselves?
I know of Performance characteristics for Scala collections, but this only mentions some very basic operations. Perhaps the rest of the operations for these collections are built purely from a small base-set, but then, it seems I am just expected to know that they have implemented them in this way?
The guide for the other methods should be - just think what an efficient implementation should look like.
Most other bulk-operations on collections (operations that process each element in the collection) are O(n), so they are not mentioned there. Examples are filter, map, foreach, indexOf, reverse, find ...
Methods returning iterators or streams like combinations and permutations are usually O(1).
Methods involving 2 collections are usually O(max(n, m)) or O(min(n, m)). These are zip, zipAll, sameElements, corresponds, ...
Methods union, diff, and intersect are O(n + m).
Sort variants are, naturally, O(nlogn). The groupBy is O(nlogn) in the current implementation. The indexOfSlice uses the KMP algorithm and is O(m + n), where m and n are lengths of the strings.
Methods such as +:, :+ or patch are generally O(n) as well, unless you are dealing with a specific case of an immutable collection for which the operation in question is more efficient - for example, prepending an element on a functional List or appending an element to a Vector.
Methods toX are generally O(n), as they have to iterate all the elements and create a new collection. An exception is toStream which builds the collection lazily - so it's O(1). Also, whenever X is the type of the collection toX just returns this, being O(1).
Iterator implementations should have an O(1) (amortized) next and hasNext operations. Iterator creation should be worst-case O(logn), but O(1) in most cases.
Performance characteristics of the other methods is really difficult to assert. Consider the following:
These methods are all implemented based on foreach or iterator, and at usually very high levels in the hierachy. Vector's map is implemented on collection.TraversableLike, for example.
To add insult to injury, which method implementation is used depends on the linearization of the class inheritance. This also applies to any method called as a helper. It has happened before that changes here caused unforeseen performance problems.
Since foreach and iterator are both O(n), any improved performance depends on specialization at other methods, such as size and slice.
For many of them, there's further dependency on the performance characteristics of the builder that was provided, which depends on the call site instead of the definition site.
So the result is that the place where the method is defined -- and documented -- does not have near enough information to state its performance characteristics, and may depend not only on how other methods are implemented by the inheriting collection, but even by the performance characteristics of an object, Builder, obtained from CanBuildFrom, that is passed at the call site.
At best, any such documentation would be described in terms of other methods. Which doesn't mean it isn't worthwhile, but it isn't easily done -- and hard tasks on open source projects depend on volunteers, who usually work at what they like, not what is needed.

Reason for Scala's Map.unzip returning (Iterable, Iterable)

the other day I was wondering why scala.collection.Map defines its unzip method as
def unzip [A1, A2] (implicit asPair: ((A, B)) ⇒ (A1, A2)): (Iterable[A1], Iterable[A2])
Since the method returns "only" a pair of Iterable instead of a pair of Seq it is not guaranteed that the key/value pairs in the original map occur at matching indices in the returned sequences since Iterable doesn't guarantee the order of traversal. So if I had a
Map((1,A), (2,B))
, then after calling
Map((1,A), (2,B)) unzip
I might end up with
... = (List(1, 2),List(A, B))
just as well as with
... = (List(2, 1),List(B, A))
While I can imagine storage-related reasons behind this (think of HashMaps, for example) I wonder what you guys think about this behavior. It might appear to users of the Map.unzip method that the items were returned in the same pair order (and I bet this is probably almost always the case) yet since there's no guarantee this might in turn yield hard-to-find bugs in the library user's code.
Maybe that behavior should be expressed more explicitly in the accompanying scaladoc?
EDIT: Please note that I'm not referring to maps as ordered collections. I'm only interested in "matching" sequences after unzip, i.e. for
val (keys, values) = someMap.unzip
it holds for all i that (keys(i), values(i)) is an element of the original mapping.
Actually, the examples you gave will not occur. The Map will always be unzipped in a pair-wise fashion. Your statement that Iterable does not guarantee the ordering, is not entirely true. It is more accurate to say that any given Iterable does not have to guarantee the ordering, but this is dependent on implementation. In the case of Map.unzip, the ordering of pairs is not guaranteed, but items in the pairs will not change they way they are matched up -- that matching is a fundamental property of the Map. You can read the source to GenericTraversableTemplate to verify this is the case.
If you expand unzip's description, you'll get the answer:
definition classes: GenericTraversableTemplate
In other words, it didn't get specialized for Map.
Your argument is sound, though, and I daresay you might get your wishes if you open an enhancement ticket with your reasoning. Specially if you go ahead an produce a patch as well -- if nothing else, at least you'll learn a lot more about Scala collections in doing so.
Maps, generally, do not have a natural sequence: they are unordered collections. The fact your keys happen to have a natural order does not change the general case.
(However I am at a loss to explain why Map has a zipWithIndex method. This provides a counter-argument to my point. I guess it is there for consistency with other collections and that, although it provides indices, they are not guaranteed to be the same on subsequent calls.)
If you use a LinkedHashMap or LinkedHashSet the iterators are supposed to return the pairs in the original order of insertion. Other HashMaps, yeah, you have no control. Retaining the original order of insertion is quite useful in UI contexts, it allows you to resort tables on any column in a Web application without changing types, for instance.

What is the prefered way in using the parallel collections in Scala?

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).