What are the drawbacks of a group_by iterator adaptor - group-by

Imagine an iterator over (key,value) pairs where the key isn't unique. What downsides would there be to creating an iterator adapter that splits the iterator into many "streams" which any subsequent adapters would operate on?
I'm imagining something where the following pseudocode would return the 10th element for each key:
let res: HashMap<key_type, value_type> =
my_input_stream.iter()
.group_by(|(key, value)| key)
.skip(9)
.take(1)
.collect();
I'm aware of itertools' group_by, but it only works for consecutive groups. I am talking about something that would work when the groups are nonconsecutive as well. I think my idea could run into a lot of problems with the design of Rust's iterators.
To be more concrete, the two implementations that I am considering are:
an iterator that produces many iterators, somewhat akin to unzip.
This has the problem that you probably couldn't chain any more
adapters after that point. (similar to: Splitting Iterator<(A,B)> into Iterator<A> and Iterator<B>)
an iterator that you call next with a group key argument.
That would potentially have to buffer a lot of data and breaks the
signature of next.

Related

Convert tuple to array in Scala

What is the best way to convert a tuple into an array in Scala? Here "best" means in as few lines of code as possible. I was shocked to search Google and StackOverflow only to find nothing on this topic, which seems like it should be trivial and common. Lists have a a toArray function; why don't tuples?
Use productIterator, immediately followed by toArray:
(42, 3.14, "hello", true).productIterator.toArray
gives:
res0: Array[Any] = Array(42, 3.14, hello, true)
The type of the result shows the main reason why it's rarely used: in tuples, the types of the elements can be heterogeneous, in arrays they must be homogeneous, so that often too much type information is lost during this conversion. If you want to do this, then you probably shouldn't have stored your information in tuples in the first place.
There is simply almost nothing you can (safely) do with an Array[Any], except printing it out, or converting it to an even more degenerate Set[Any]. Instead you could use:
Lists of case classes belonging to a common sealed trait,
shapeless HLists,
a carefully chosen base class with a bit of inheritance,
or something that at least keeps some kind of schema at runtime (like Apache Spark Datasets)
they would all be better alternatives.
In the somewhat less likely case that the elements of the "tuples" that you are processing frequently turn out to have an informative least upper bound type, then it might be because you aren't working with plain tuples, but with some kind of traversable data structure that puts restrictions on the number of substructures in the nodes. In this case, you should consider implementing something like Traverse interface for the structure, instead of messing with some "tuples" manually.

Chain-signing elements of an Akka Stream

I have a Source[ByteString] that needs to be transformed in the following manner:
Each ByteString element needs to have a cryptographic signature concatenated to it.
The signature to be concatenated to the first element in the stream is known beforehand.
Each signature after that is computed from the data in the preceding element (the unsigned version), then concatenated to its corresponding element.
The desired result is conceptually a Source of data chunks that are chained together via signatures. The overall functional signature would be (Source[ByteString]) => Source[ByteString]. Almost like a map() but computationally tying later elements to their predecessors.
I had hoped to use Source.fold() to achieve this, but I can't find the right zero that will let me carry signatures into each iteration of the fold, but also emit the signed chunks as if the stream had simply been mapped over.
(Concretely, I'm trying to implement the AWS S3 algorithm for signing a chunked object upload, as documented at http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html#sigv4-chunked-body-definition.)
What's a good approach to take here?
You were almost there with your idea to use fold here - you need scan which is essentially the same but keeps the stream of transformed elements, and passes the result to the next iteration.
So, given you have a method def sign(prevElement: ByteString, thisElement: ByteString): ByteString, your flow can be described as follows:
Flow[String]
.scan(zeroSignature)(sign)
.drop(1)
scan keeps zero element which you probably don't need, so drop(1) is needed to discard it.

Why doesn't scala's parallel sequences have a contains method?

Why does
List.range(0,100).contains(2)
Work, while
List.range(0,100).par.contains(2)
Does not?
This is planned for the future?
The non-teleological answer is that it's because contains is defined in SeqLike but not in ParSeqLike.
If that doesn't satisfy your curiosity, you can find that SeqLike's contains is defined thus:
def contains(elem: Any): Boolean = exists (_ == elem)
So for your example you can write
List.range(0,100).par.exists(_ == 2)
ParSeqLike is missing a few other methods as well, some of which would be hard to implement efficiently (e.g. indexOfSlice) and some for less obvious reasons (e.g. combinations - maybe because that's only useful on small datasets). But if you have a parallel collection you can also use .seq to get back to the linear version and get your methods back:
List.range(0,100).par.seq.contains(2)
As for why the library designers left it out... I'm totally guessing, but maybe they wanted to reduce the number of methods for simplicity's sake, and it's nearly as easy to use exists.
This also raises the question, why is contains defined on SeqLike rather than on the granddaddy of all collections, GenTraversableOnce, where you find exists? A possible reason is that contains for Map is semantically a different method to that on Set and Seq. A Map[A,B] is a Traversable[(A,B)], so if contains were defined for Traversable, contains would need to take a tuple (A,B) argument; however Map's contains takes just an A argument. Given this, I think contains should be defined in GenSeqLike - maybe this is an oversight that will be corrected.
(I thought at first maybe parallel sequences don't have contains because searching where you intend to stop after finding your target on parallel collections is a lot less efficient than the linear version (the various threads do a lot of unnecessary work after the value is found: see this question), but that can't be right because exists is there.)

How do I deal with Scala collections generically?

I have realized that my typical way of passing Scala collections around could use some improvement.
def doSomethingCool(theFoos: List[Foo]) = { /* insert cool stuff here */ }
// if I happen to have a List
doSomethingCool(theFoos)
// but elsewhere I may have a Vector, Set, Option, ...
doSomethingCool(theFoos.toList)
I tend to write my library functions to take a List as the parameter type, but I'm certain that there's something more general I can put there to avoid all the occasional .toList calls I have in the application code. This is especially annoying since my doSomethingCool function typically only needs to call map, flatMap and filter, which are defined on all the collection types.
What are my options for that 'something more general'?
Here are more general traits, each of which extends the previous one:
GenTraversableOnce
GenTraversable
GenIterable
GenSeq
The traits above do not specify whether the collection is sequential or parallel. If your code requires that things be executed sequentially (typically, if your code has side effects of any kind), they are too general for it.
The following traits mandate sequential execution:
TraversableOnce
Traversable
Iterable
Seq
LinearSeq
The first one, TraversableOnce only allows you to call one method on the collection. After that, the collection has been "used". In exchange, it is general enough to accept iterators as well as collections.
Traversable is a pretty general collection that has most methods. There are some things it cannot do, however, in which case you need to go to Iterable.
All Iterable implement the iterator method, which allows you to get an Iterator for that collection. This gives it the capability for a few methods not present in Traversable.
A Seq[A] implements the function Int => A, which means you can access any element by its index. This is not guaranteed to be efficient, but it is a guarantee that each element has an index, and that you can make assertions about what that index is going to be. Contrast this with Map and Set, where you cannot tell what the index of an element is.
A LinearSeq is a Seq that provides fast head, tail, isEmpty and prepend. This is as close as you can get to a List without actually using a List explicitly.
Alternatively, you could have an IndexedSeq, which has fast indexed access (something List does not provide).
See also this question and this FAQ based on it.
The most obvious one is to use Traversable as the most general trait which will have the goodies you want. However, I think you are generally better sticking to:
Seq
IndexedSeq
Set
Map
A Seq will cover List, Vector etc, IndexedSeq will cover Vector etc etc. I found myself not using Iterable because I often need (or want) to know the size of the thing I have and back pre scala-2.8 Iterable did not provide access to this, so I kept having to turn things into sequences anyway!
Looks like Traversable and Iterable now have size methods so maybe I should go back to using them! Of course you could start "going mad" with GenTraversableOnce but that is not likely to aid in readability.

Complexity of List.reverse?

In Scala, there is reverse method for lists. What is the complexity of this method? Is it better to simply use the original list and always remember that the list is the reverse of what we expect, or to explicitly use reverse before operating on it.
EDIT: What I am really interested in is to get the last two elements of the original list (or the first two of the reversed list).
So I would do something like:
val myList = origList.reverse
val a = myList(0)
val b = myList(1)
This is not in a loop, just a one-time thing in my library... but if someone else uses the library and puts it in a loop, it is not under my control.
Looking at the source, it's O(n) as you might reasonably expect:
override def reverse: List[A] = {
var result: List[A] = Nil
var these = this
while (!these.isEmpty) {
result = these.head :: result
these = these.tail
}
result
}
If in your code you're able to iterate through the list in reverse order at the same cost of iterating in forward order, then it would be more efficient to do this rather than reversing the List.
In fact, if your alternative operation which involves using the original list works in less than O(n) time, then there's a real argument for going with that. Making an algorithm asymptotically faster will make a huge difference if you ever rely on it more (especially if used inside other loops, as oxbow_lakes points out below).
On the whole though I'd expect that anything where you're reversing a list means that you care about the relative ordering of a non-trivial number of elements, and so whatever you're doing is inherently O(n) anyway. (This might not be true for other data structures such as a binary tree; but lists are linear, and in the extreme case even reverse . head can't be done in O(1) time with a singly-linked list.)
So if you're choosing between two O(n) options - for the vast majority of applications, shaving a few nanoseconds off the iteration time isn't going to really gain you anything. Hence it would be "best" to make your code as readable as possible - which means calling reverse and then iterating, if that's closest to your intention.
(And if your app is too slow, and profiling shows that this list manipulation is a hotspot, then you can think about how to make it more efficient. Which by that point may well involve a different option to both of your current candidates, given the extra context you'll have at that point.)