Reason for Scala's Map.unzip returning (Iterable, Iterable) - scala

the other day I was wondering why scala.collection.Map defines its unzip method as
def unzip [A1, A2] (implicit asPair: ((A, B)) ⇒ (A1, A2)): (Iterable[A1], Iterable[A2])
Since the method returns "only" a pair of Iterable instead of a pair of Seq it is not guaranteed that the key/value pairs in the original map occur at matching indices in the returned sequences since Iterable doesn't guarantee the order of traversal. So if I had a
Map((1,A), (2,B))
, then after calling
Map((1,A), (2,B)) unzip
I might end up with
... = (List(1, 2),List(A, B))
just as well as with
... = (List(2, 1),List(B, A))
While I can imagine storage-related reasons behind this (think of HashMaps, for example) I wonder what you guys think about this behavior. It might appear to users of the Map.unzip method that the items were returned in the same pair order (and I bet this is probably almost always the case) yet since there's no guarantee this might in turn yield hard-to-find bugs in the library user's code.
Maybe that behavior should be expressed more explicitly in the accompanying scaladoc?
EDIT: Please note that I'm not referring to maps as ordered collections. I'm only interested in "matching" sequences after unzip, i.e. for
val (keys, values) = someMap.unzip
it holds for all i that (keys(i), values(i)) is an element of the original mapping.

Actually, the examples you gave will not occur. The Map will always be unzipped in a pair-wise fashion. Your statement that Iterable does not guarantee the ordering, is not entirely true. It is more accurate to say that any given Iterable does not have to guarantee the ordering, but this is dependent on implementation. In the case of Map.unzip, the ordering of pairs is not guaranteed, but items in the pairs will not change they way they are matched up -- that matching is a fundamental property of the Map. You can read the source to GenericTraversableTemplate to verify this is the case.

If you expand unzip's description, you'll get the answer:
definition classes: GenericTraversableTemplate
In other words, it didn't get specialized for Map.
Your argument is sound, though, and I daresay you might get your wishes if you open an enhancement ticket with your reasoning. Specially if you go ahead an produce a patch as well -- if nothing else, at least you'll learn a lot more about Scala collections in doing so.

Maps, generally, do not have a natural sequence: they are unordered collections. The fact your keys happen to have a natural order does not change the general case.
(However I am at a loss to explain why Map has a zipWithIndex method. This provides a counter-argument to my point. I guess it is there for consistency with other collections and that, although it provides indices, they are not guaranteed to be the same on subsequent calls.)

If you use a LinkedHashMap or LinkedHashSet the iterators are supposed to return the pairs in the original order of insertion. Other HashMaps, yeah, you have no control. Retaining the original order of insertion is quite useful in UI contexts, it allows you to resort tables on any column in a Web application without changing types, for instance.

Related

What is the fastest way to join two iterables (or sequences) in Scala?

Say I want to create an Iterable by joining multiple Iterables. What would be the fastest way to do so?
Documentation for ListBuffer's ++= says that it
Appends all elements produced by a TraversableOnce to this list buffer.
which sounds like it appends the elements one by one and hence should take Ω(size of iterable to append) time. I want something that joins the two Iterables in O(1) time. Does ::: method of List runs in O(1) time? Documentation simply states that it
Adds the elements of a given list in front of this list.
I also checked out performance characteristics of scala collections but it says nothing about joining two collections.
You can create a Stream:
Stream(a, b, c).flatten
or concatenate iterators (this gives you Stream at runtime anyway)
(a.iterator ++ b.iterator ++ c.iterator).toIterable
(assuming that a, b and c are Iterable[Int])
That will give you something along the lines of O(number of collections to join)
Streams are evaluated on demand, so only little memory allocation will happen for closures until you actually request any elements. Converting to anything else is unavoidably O(n).

How to groupBy an iterator without converting it to list in scala?

Suppose I want to groupBy on a iterator, compiler asks to "value groupBy is not a member of Iterator[Int]". One way would be to convert iterator to list which I want to avoid. I want to do the groupBy such that the input is Iterator[A] and output is Map[B, Iterator[A]]. Such that the part of the iterator is loaded only when that part of element is accessed and not loading the whole list into memory. I also know the possible set of keys, so I can say whether a particular key exists.
def groupBy(iter: Iterator[A], f: fun(A)->B): Map[B, Iterator[A]] = {
.........
}
One possibility is, you can convert Iterator to view and then groupBy as,
iter.toTraversable.view.groupBy(_.whatever)
I don't think this is doable without storing results in memory (and in this case switching to a list would be much easier). Iterator implies that you can make only one pass over the whole collection.
For instance let's say you have a sequence 1 2 3 4 5 6 and you want to groupBy odd an even numbers:
groupBy(it, v => v % 2 == 0)
Then you could query the result with either true and false to get an iterator. The problem should you loop one of those two iterators till the end you couldn't do the same thing for the other one (as you cannot reset an iterator in Scala).
This would be doable should the elements were sorted according to the same rule you're using in groupBy.
As said in other responses, the only way to achieve a lazy groupBy on Iterator is to internally buffer elements. The worst case for the memory will be in O(n). If you know in advance that the keys are well distributed in your iterator, the buffer can be a viable solution.
The solution is relatively complex, but a good start are some methods from the Iterator trait in the Scala source code:
The partition method that uses both the buffered method to keep the head value in memory, and two internal queues (lookahead) for each of the produced iterators.
The span method with also the buffered method and this time a unique queue for the leading iterator.
The duplicate method. Perhaps less interesting, but we can again observe another use of a queue to store the gap between the two produced iterators.
In the groupBy case, we will have a variable number of produced iterators instead of two in the above examples. If requested, I can try to write this method.
Note that you have to know the list of keys in advance. Otherwise, you will need to traverse (and buffer) the entire iterator to collect the different keys to build your Map.

Scala's toList function appears to be slow

I was under the impression that calling seq.toList() on an immutable Seq would be making a new list which is sharing the structural state from the first list. We're finding that this could be really slow and I'm not sure why. It is just sharing the structural state, correct? I can't see why it'd be making an n-time copy of all the elements when it knows they'll never change.
A List in Scala is a particular data structure: instances of :: each containing a value, followed by Nil at the end of the chain.
If you toList a List, it will take O(1) time. If you toList on anything else then it must be converted into a List, which involves O(n) object allocations (all the :: instances).
So you have to ask whether you really want a scala.collection.immutable.List. That's what toList gives you.
Sharing structural state is possible for particular operations on particular data structures.
With the List data structure in Scala, my understanding is that every element refers to the next, starting from the head through the tail, so a singly linked list.
From a structural state sharing perspective, consider the restrictions placed on this from the internal data structure perspective. Adding an element to the head of a List (X) effectively creates a new list (X') with the new element as the head of X' and the old list (X) as the tail. For this particular operation, internal state can be shared completely.
The same operation above can be applied to create a new List (X'), with the new element as the head of X' and any element from X as the tail, as long as you accept that the tail will be the element you choose from X, plus all additional elements it already has in it's data structure.
When you think about it logically, each data structure has an internal structure that allows some operations to be performed with simple shared internal structure and other operations requiring more invasive and costly computations.
The key from my perspective here is having an understanding of the constraints placed on the operations by the internal data structure itself.
For example, consider the same operations above on a doubly linked list data structure and you will see that there are quite different restrictions.
Personally, I find drawing out an understanding of the internal structure can be helpful in understanding the consequences of particular operations.
In the case of the toList operation on an arbitrary sequence, with no knowledge of the arbitrary sequences internal data structure, one has to therefore assume O(n). List.toList has the obvious performance advantage of already being a list.

Why no immutable double linked list in Scala collections?

Looking at this question, where the questioner is interested in the first and last instances of some element in a List, it seems a more efficient solution would be to use a DoubleLinkedList that could search backwards from the end of the list. However there is only one implementation in the collections API and it's mutable.
Why is there no immutable version?
Because you would have to copy the whole list each time you want to make a change. With a normal linked list, you can at least prepend to the list without having to copy everything. And if you do want to copy everything on every change, you don't need a linked list for that. You can just use an immutable array.
There are many impediments to such a structure, but one is very pressing: a doubly linked list cannot be persistent.
The logic behind this is pretty simple: from any node on the list, you can reach any other node. So, if I added an element X to this list DL, and tried to use a part of DL, I'd face this contradiction: from the node pointing to X one can reach every element in part(DL), but, by the properties of the doubly linked list, that means from any element of part(DL) I can reach the node pointing to X. Since part(DL) is supposed to be immutable and part of DL, and since DL did not include the node pointing to X, that just cannot be.
Non-persistent immutable data structures might have some uses, but they are generally bad for most operations, since they need to be recreated whenever a derivative is produced.
Now, there's the minor matter of creating mutually referencing strict objects, but this is surmountable. One can use by-name parameters and lazy vals, or one can do like Scala's List: actually create a mutable collection, and then "freeze" it in immutable state (see ListBuffer and it's toList method).
Because it is logically impossible to create a mutually (circular) referential data-structure with strict immutability.
You cannot create two nodes that point to each other due to simple existential ordering priority, in that at least one of the nodes will not exist when the other is created.
It is possible to get this circularity with tricks involving laziness (which is implemented with mutation), but the real question then becomes why you would want this thing in the first place?
As others have noted, there is no persistent implementation of a double-linked list. You will need some kind of tree to get close to the characteristics you want.
In particular, you may want to look at finger trees, which provide O(1) access to the front and back, amortized O(1) insertion to the front and back, and O(log n) insertion elsewhere. (That's in contrast to most other commonly-used trees which have O(log n) access and insertion everywhere.)
See also:
video explanation of finger trees (by the implementor of finger trees in clojure.contrib)
finger tree implementation in Scala (I haven't used it personally, but it's the top google hit)
As a supplemental to the answer of #KimStebel I like to add:
If you are searching for a data structure suitable for the question that motivated you to ask this question, then you might have a look at Extreme Cleverness: Functional Data Structures in Scala by #DanielSpiewak.

Scala SeqLike distinct preserves order?

The apidoc of distinct in SeqLike says:
Builds a new sequence from this sequence without any duplicate elements.
Returns: A new sequence which contains the first occurrence of every element of this sequence.
Do I feel it correct that no ordering guarantee is provided? More generally, do methods of SeqLike provide any process-in-order (and return-in-order) guarantee?
On the contrary: operations on Seqs guarantee the output order (unless the API says otherwise). This is one of the basic properties of sequences, where the order matters, versus sets, where only containment matters.
It depends on the collection you were using in the first place. If you had a list you'll get your order. If on the other hand you had a set, then probably not.