Scala SeqLike distinct preserves order?

Scala SeqLike distinct preserves order? - scala

The apidoc of distinct in SeqLike says:
Builds a new sequence from this sequence without any duplicate elements.
Returns: A new sequence which contains the first occurrence of every element of this sequence.
Do I feel it correct that no ordering guarantee is provided? More generally, do methods of SeqLike provide any process-in-order (and return-in-order) guarantee?

On the contrary: operations on Seqs guarantee the output order (unless the API says otherwise). This is one of the basic properties of sequences, where the order matters, versus sets, where only containment matters.

It depends on the collection you were using in the first place. If you had a list you'll get your order. If on the other hand you had a set, then probably not.

Related

What's the difference between -- and &~ in Scala

'--' the document say: Creates a new $coll from this $coll by removing all elements of another collection.
'&~' the document say: The difference of this set and another set.
i use both of two symbol can get same result.
so is there have some difference between this two symbol? For example, there are different time complexity or memory occupation?

According to the Scaladoc.
-- accepts any kind of IterableOnce whereas &~ only accepts other Sets
Arguably, if you have two sets, you should prefer &~ since it is probably optimized.

What are Builder, Combiner, and Splitter in scala?

In the parallel programming course from EPFL, four abstractions for data parallelism are mentioned: Iterator, Builder, Combiner, and Splitter.
I am familiar with Iterator, but have never used the other three. I have seen other traits Builder, Combiner, and Splitter under scala.collection package. However, I have idea how to use them in real-world development, particularly how to use them in collaboration with other collections like List, Array, ParArray, etc. Could anyone please give me some guidance and examples?
Thanks!

The two traits Iterator and Builder are not specific to parallelism, however, they provide the basis for Combiner and Splitter.
You already know that an Iterator can help you with iterating over sequential collections by providing the methods hasNext and next. A Splitter is a special case of an Iterator and helps to partition a collection into multiple disjoint subsets. The idea is that after the splitting, these subsets can be processed in parallel. You can obtain a Splitter from a parallel collection by invoking .splitter on it. The two important methods of the Splitter trait are as follows:
remaining: Int: returns the number of elements of the current collection, or at least an approximation of that number. This information is important, since it is used to decide whether or not it's worth it to split the collection. If your collection contains only a small amount of elements, then you want to process these elements sequentially instead of splitting the collection into even smaller subsets.
split: Seq[Splitter[A]]: the method that actually splits the current collection. It returns the disjoint subsets (represented as Splitters), which recursively can be splitted again if it's worth it. If the subsets are small enough, they finally can be processed (e.g. filtered or mapped).
Builders are used internally to create new (sequential) collections. A Combiner is a special case of a Builder and at the same time represents the counterpart to Splitter. While a Splitter splits your collection before it is processed in parallel, a Combiner puts together the results afterwards. You can obtain a Combiner from a parallel collection (subset) by invoking .newCombiner on it. This is done via the following method:
combine(that: Combiner[A, B]): Combiner[A, B]: combines your current collection with another collection by "merging" both Combiners. The result is a new Combiner, which either represents the final result, or gets combined again with another subset (by the way: the type parameters A and B represent the element type and the type or the resulting collection).
The thing is that you don't need to implement or even use these methods directly if you don't define a new parallel collection. The idea is that people implementing new parallel collections only need to define splitters and combiners and get a whole bunch of other operations for free, because those operations are already implemented and make use of splitters and combiners.
Of course this is only a superficial description of how those things work. For further reading, I recommend reading Architecture of the Parallel Collections Library as well as Creating Custom Parallel Collections.

Partially sorting collections in Scala

I am trying to sort a collection of linked-list nodes. The collection contains nodes from more than one linked list; ordering must be maintained within each list, but ordering across lists does not matter.
PartialOrdering[T] seems like the natural choice, but I cannot find any standard functions within Scala that support it (e.g. .sort only takes Ordering[T]).
I've considered wrapping the former type into the latter, but realise this will actually produce erroneous results. Partial ordering cannot be abstracted-away like this as the underlying sort algorithm needs the additional information to produce correct results.
I would like to represent the elements as a SortedSet - is anyone aware of anything that can get me close?

With parallel collection, does aggregate respect order?

in scala, i have a parallel Iterable of items and i want to iterate over them and aggregate the results in some way, but in order. i'll simplify my use case and say that we start with an Iterable of integers and want to concatenate the string representation of them in paralle, with the result in order.
is this possible with either fold or aggregate? it's unclear from the documentation which methods work parallelized but maintain order.

Yes, order is gauranteed to be preserved for fold/aggregate/reduce operations on parallel collections. This is not very well documented. The trick is that the operation you which to fold over must be associative (and thus capable of being arbitrarily split up and recombined), but need not be commutative (and so not capable of being safely reordered). String concatenation is a perfect example of an associative, non-commutative operation, so the fold can be done in parallel.
val concat = myParallelList.map(_.toString).reduce(_+_)

For folds: foldRight and foldLeft cannot be processed in parallel, you'll need to use the new fold method (more info there).
Like fold, aggregate can do its work in parallel: it “traverses the elements in different partitions sequentially” (Scaladoc), though it looks like you have no direct influence on how the partitions are chosen.

I THINK the preservation of 'order' in the sense of the comment to Jean-Philippe Pellets answer is guaranteed due to the way parallel collections are implemented according to a publication of Odersky (http://infoscience.epfl.ch/record/150220/files/pc.pdf) IFF the part that splits your collection is behaving well with respect to order.
i.e. if you have elements a < b < c and a and c end up in one partition it follows that b is in the same partition as well.
I don't remember what exactly was the part responsible for the splitting, but if you find it, you might sufficient information in its documentation or source code in order to answer your question.

Reason for Scala's Map.unzip returning (Iterable, Iterable)

the other day I was wondering why scala.collection.Map defines its unzip method as
def unzip [A1, A2] (implicit asPair: ((A, B)) ⇒ (A1, A2)): (Iterable[A1], Iterable[A2])
Since the method returns "only" a pair of Iterable instead of a pair of Seq it is not guaranteed that the key/value pairs in the original map occur at matching indices in the returned sequences since Iterable doesn't guarantee the order of traversal. So if I had a
Map((1,A), (2,B))
, then after calling
Map((1,A), (2,B)) unzip
I might end up with
... = (List(1, 2),List(A, B))
just as well as with
... = (List(2, 1),List(B, A))
While I can imagine storage-related reasons behind this (think of HashMaps, for example) I wonder what you guys think about this behavior. It might appear to users of the Map.unzip method that the items were returned in the same pair order (and I bet this is probably almost always the case) yet since there's no guarantee this might in turn yield hard-to-find bugs in the library user's code.
Maybe that behavior should be expressed more explicitly in the accompanying scaladoc?
EDIT: Please note that I'm not referring to maps as ordered collections. I'm only interested in "matching" sequences after unzip, i.e. for
val (keys, values) = someMap.unzip
it holds for all i that (keys(i), values(i)) is an element of the original mapping.

Actually, the examples you gave will not occur. The Map will always be unzipped in a pair-wise fashion. Your statement that Iterable does not guarantee the ordering, is not entirely true. It is more accurate to say that any given Iterable does not have to guarantee the ordering, but this is dependent on implementation. In the case of Map.unzip, the ordering of pairs is not guaranteed, but items in the pairs will not change they way they are matched up -- that matching is a fundamental property of the Map. You can read the source to GenericTraversableTemplate to verify this is the case.

If you expand unzip's description, you'll get the answer:
definition classes: GenericTraversableTemplate
In other words, it didn't get specialized for Map.
Your argument is sound, though, and I daresay you might get your wishes if you open an enhancement ticket with your reasoning. Specially if you go ahead an produce a patch as well -- if nothing else, at least you'll learn a lot more about Scala collections in doing so.

Maps, generally, do not have a natural sequence: they are unordered collections. The fact your keys happen to have a natural order does not change the general case.
(However I am at a loss to explain why Map has a zipWithIndex method. This provides a counter-argument to my point. I guess it is there for consistency with other collections and that, although it provides indices, they are not guaranteed to be the same on subsequent calls.)

If you use a LinkedHashMap or LinkedHashSet the iterators are supposed to return the pairs in the original order of insertion. Other HashMaps, yeah, you have no control. Retaining the original order of insertion is quite useful in UI contexts, it allows you to resort tables on any column in a Web application without changing types, for instance.