How to groupBy an iterator without converting it to list in scala? - scala

Suppose I want to groupBy on a iterator, compiler asks to "value groupBy is not a member of Iterator[Int]". One way would be to convert iterator to list which I want to avoid. I want to do the groupBy such that the input is Iterator[A] and output is Map[B, Iterator[A]]. Such that the part of the iterator is loaded only when that part of element is accessed and not loading the whole list into memory. I also know the possible set of keys, so I can say whether a particular key exists.
def groupBy(iter: Iterator[A], f: fun(A)->B): Map[B, Iterator[A]] = {
.........
}

One possibility is, you can convert Iterator to view and then groupBy as,
iter.toTraversable.view.groupBy(_.whatever)

I don't think this is doable without storing results in memory (and in this case switching to a list would be much easier). Iterator implies that you can make only one pass over the whole collection.
For instance let's say you have a sequence 1 2 3 4 5 6 and you want to groupBy odd an even numbers:
groupBy(it, v => v % 2 == 0)
Then you could query the result with either true and false to get an iterator. The problem should you loop one of those two iterators till the end you couldn't do the same thing for the other one (as you cannot reset an iterator in Scala).
This would be doable should the elements were sorted according to the same rule you're using in groupBy.

As said in other responses, the only way to achieve a lazy groupBy on Iterator is to internally buffer elements. The worst case for the memory will be in O(n). If you know in advance that the keys are well distributed in your iterator, the buffer can be a viable solution.
The solution is relatively complex, but a good start are some methods from the Iterator trait in the Scala source code:
The partition method that uses both the buffered method to keep the head value in memory, and two internal queues (lookahead) for each of the produced iterators.
The span method with also the buffered method and this time a unique queue for the leading iterator.
The duplicate method. Perhaps less interesting, but we can again observe another use of a queue to store the gap between the two produced iterators.
In the groupBy case, we will have a variable number of produced iterators instead of two in the above examples. If requested, I can try to write this method.
Note that you have to know the list of keys in advance. Otherwise, you will need to traverse (and buffer) the entire iterator to collect the different keys to build your Map.

Related

Is there any benefit of working with an Iterator over a List

Is there any benefit of manipulating an Iterator over or List ?
I need to know if concatenating 2 iterators is better that concatenating to List ?
In a sense what the fundamental difference between working with iterator over the actual collection.
An Iterator isn't an actual data structure, although it behaves similar to one. It is just a traversal pointer to some actual data structure. Thus, unlike in an actual data structure, an Iterator can't "go back," that is, access old elements. Once you've gone through an Iterator, you're done.
What's cool about Iterator is that you can give it a map, filter, or other transformation elements, and instead of actually modifying any existing data structure, it will instead apply the transformation the next time you ask for an element.
"Concatenating" two Iterators creates a new Iterator that wraps both of them.
On the other hand, Lists are actual collections and can be re-traversed.

What is the fastest way to join two iterables (or sequences) in Scala?

Say I want to create an Iterable by joining multiple Iterables. What would be the fastest way to do so?
Documentation for ListBuffer's ++= says that it
Appends all elements produced by a TraversableOnce to this list buffer.
which sounds like it appends the elements one by one and hence should take Ω(size of iterable to append) time. I want something that joins the two Iterables in O(1) time. Does ::: method of List runs in O(1) time? Documentation simply states that it
Adds the elements of a given list in front of this list.
I also checked out performance characteristics of scala collections but it says nothing about joining two collections.
You can create a Stream:
Stream(a, b, c).flatten
or concatenate iterators (this gives you Stream at runtime anyway)
(a.iterator ++ b.iterator ++ c.iterator).toIterable
(assuming that a, b and c are Iterable[Int])
That will give you something along the lines of O(number of collections to join)
Streams are evaluated on demand, so only little memory allocation will happen for closures until you actually request any elements. Converting to anything else is unavoidably O(n).

What are Builder, Combiner, and Splitter in scala?

In the parallel programming course from EPFL, four abstractions for data parallelism are mentioned: Iterator, Builder, Combiner, and Splitter.
I am familiar with Iterator, but have never used the other three. I have seen other traits Builder, Combiner, and Splitter under scala.collection package. However, I have idea how to use them in real-world development, particularly how to use them in collaboration with other collections like List, Array, ParArray, etc. Could anyone please give me some guidance and examples?
Thanks!
The two traits Iterator and Builder are not specific to parallelism, however, they provide the basis for Combiner and Splitter.
You already know that an Iterator can help you with iterating over sequential collections by providing the methods hasNext and next. A Splitter is a special case of an Iterator and helps to partition a collection into multiple disjoint subsets. The idea is that after the splitting, these subsets can be processed in parallel. You can obtain a Splitter from a parallel collection by invoking .splitter on it. The two important methods of the Splitter trait are as follows:
remaining: Int: returns the number of elements of the current collection, or at least an approximation of that number. This information is important, since it is used to decide whether or not it's worth it to split the collection. If your collection contains only a small amount of elements, then you want to process these elements sequentially instead of splitting the collection into even smaller subsets.
split: Seq[Splitter[A]]: the method that actually splits the current collection. It returns the disjoint subsets (represented as Splitters), which recursively can be splitted again if it's worth it. If the subsets are small enough, they finally can be processed (e.g. filtered or mapped).
Builders are used internally to create new (sequential) collections. A Combiner is a special case of a Builder and at the same time represents the counterpart to Splitter. While a Splitter splits your collection before it is processed in parallel, a Combiner puts together the results afterwards. You can obtain a Combiner from a parallel collection (subset) by invoking .newCombiner on it. This is done via the following method:
combine(that: Combiner[A, B]): Combiner[A, B]: combines your current collection with another collection by "merging" both Combiners. The result is a new Combiner, which either represents the final result, or gets combined again with another subset (by the way: the type parameters A and B represent the element type and the type or the resulting collection).
The thing is that you don't need to implement or even use these methods directly if you don't define a new parallel collection. The idea is that people implementing new parallel collections only need to define splitters and combiners and get a whole bunch of other operations for free, because those operations are already implemented and make use of splitters and combiners.
Of course this is only a superficial description of how those things work. For further reading, I recommend reading Architecture of the Parallel Collections Library as well as Creating Custom Parallel Collections.

Converting a large sequence to a sequence with no repeats in scala really fast

So I have this large sequence, with a lot of repeats as well, and I need to convert it into a sequence with no repeats. What I have been doing so far has been converting the sequence to a set, and then back to the original sequence. Conversion to the set gets rid of the duplicates, and then I convert back into the set. However, this is very slow, as I'm given to understand that when converting to set, every pair of elements is compared, and the makes the complexity O(n^2), which is not acceptable. And since I have access to a computer with thousands of cores (through my university), I was wondering whether making things parallel would help.
Initially I thought I'd use scala Futures to parallelize the code in the following manner. Group the elements of the sequence into smaller subgroups by their hash code. That way, I have a subcollection of the original sequence, such that no element appears in two different subcollections and and every element is covered. Now I convert these smaller subcollections to sets, and back to sequences and concatenate them. This way I'm guaranteed to get a sequence with no repeats.
But I was wondering if applying the toSet method on a parallel sequence already does this. I thought I'd test this out in the scala interpreter, but I got roughly the same time for the conversion to parallel set vs the conversion to the non parallel set.
I was hoping someone could tell me whether conversion to parallel sets works this way or not. I'd be much obliged. Thanks.
EDIT: Is performing a toSet on a parallel collection faster than performing toSet on a non parallel collection?
.distinct with some of the Scala collection types is O(n) (as of Scala 2.11). It uses a hash map to record what has already been seen. With this, it linearly builds up a list:
def distinct: Repr = {
val b = newBuilder
val seen = mutable.HashSet[A]()
for (x <- this) {
if (!seen(x)) {
b += x
seen += x
}
}
b.result()
(newBuilder is like a mutable list.)
Just thinking outside the box, would it be possible that you prevent the creation of these doublons instead of trying to get rid of them afterwards ?

Reason for Scala's Map.unzip returning (Iterable, Iterable)

the other day I was wondering why scala.collection.Map defines its unzip method as
def unzip [A1, A2] (implicit asPair: ((A, B)) ⇒ (A1, A2)): (Iterable[A1], Iterable[A2])
Since the method returns "only" a pair of Iterable instead of a pair of Seq it is not guaranteed that the key/value pairs in the original map occur at matching indices in the returned sequences since Iterable doesn't guarantee the order of traversal. So if I had a
Map((1,A), (2,B))
, then after calling
Map((1,A), (2,B)) unzip
I might end up with
... = (List(1, 2),List(A, B))
just as well as with
... = (List(2, 1),List(B, A))
While I can imagine storage-related reasons behind this (think of HashMaps, for example) I wonder what you guys think about this behavior. It might appear to users of the Map.unzip method that the items were returned in the same pair order (and I bet this is probably almost always the case) yet since there's no guarantee this might in turn yield hard-to-find bugs in the library user's code.
Maybe that behavior should be expressed more explicitly in the accompanying scaladoc?
EDIT: Please note that I'm not referring to maps as ordered collections. I'm only interested in "matching" sequences after unzip, i.e. for
val (keys, values) = someMap.unzip
it holds for all i that (keys(i), values(i)) is an element of the original mapping.
Actually, the examples you gave will not occur. The Map will always be unzipped in a pair-wise fashion. Your statement that Iterable does not guarantee the ordering, is not entirely true. It is more accurate to say that any given Iterable does not have to guarantee the ordering, but this is dependent on implementation. In the case of Map.unzip, the ordering of pairs is not guaranteed, but items in the pairs will not change they way they are matched up -- that matching is a fundamental property of the Map. You can read the source to GenericTraversableTemplate to verify this is the case.
If you expand unzip's description, you'll get the answer:
definition classes: GenericTraversableTemplate
In other words, it didn't get specialized for Map.
Your argument is sound, though, and I daresay you might get your wishes if you open an enhancement ticket with your reasoning. Specially if you go ahead an produce a patch as well -- if nothing else, at least you'll learn a lot more about Scala collections in doing so.
Maps, generally, do not have a natural sequence: they are unordered collections. The fact your keys happen to have a natural order does not change the general case.
(However I am at a loss to explain why Map has a zipWithIndex method. This provides a counter-argument to my point. I guess it is there for consistency with other collections and that, although it provides indices, they are not guaranteed to be the same on subsequent calls.)
If you use a LinkedHashMap or LinkedHashSet the iterators are supposed to return the pairs in the original order of insertion. Other HashMaps, yeah, you have no control. Retaining the original order of insertion is quite useful in UI contexts, it allows you to resort tables on any column in a Web application without changing types, for instance.