Scala for comprehension efficiency? - scala

In the book "Programming In Scala", chapter 23, the author give an example like:
case class Book(title: String, authors: String*)
val books: List[Book] = // list of books, omitted here
// find all authors who have published at least two books
for (b1 <- books; b2 <- books if b1 != b2;
a1 <- b1.authors; a2 <- b2.authors if a1 == a2)
yield a1
The author said, this will translated into:
books flatMap (b1 =>
books filter (b2 => b1 != b2) flatMap (b2 =>
b1.authors flatMap (a1 =>
b2.authors filter (a2 => a1 == a2) map (a2 =>
a1))))
But if you look into the map and flatmap method definition(TraversableLike.scala), you may find, they are defined as for loops:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}
def flatMap[B, That](f: A => Traversable[B])(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
for (x <- this) b ++= f(x)
b.result
}
Well, I guess this for will continually be translated to foreach and then translated to while statement which is a construct not an expression, scala doesn't have a for construct, because it wants the for always yield something.
So, what I want to discuss with you is that, why does Scala do this "For translation" ?
The author's example used 4 generators, which will be translated into 4 level nested for loop in the end, I think it'll have really horrible performance when the books is large.
Scala encourage people to use this kind of "Syntactic Sugar", you can always see codes that heavily make use of filter, map and flatmap, which seems programmers are forgetting what they really do is nesting one loop inside another, and what achieved is only to make codes looks a bit shorter. What's your idea?

For comprehensions are syntactic sugar for monadic transformation, and, as such, are useful in all sorts of places. At that, they are much more verbose in Scala than the equivalent Haskell construct (of course, Haskell is non-strict by default, so one can't talk about performance of the construct like in Scala).
Also important, this construct keeps what is being done clear, and avoids quickly escalating indentation or unnecessary private method nesting.
As to the final consideration, whether that hides the complexity or not, I'll posit this:
for {
b1 <- books
b2 <- books
if b1 != b2
a1 <- b1.authors
a2 <- b2.authors
if a1 == a2
} yield a1
It is very easy to see what is being done, and the complexity is clear: b^2 * a^2 (the filter won't alter the complexity), for number of books and number of authors. Now, write the same code in Java, either with deep indentation or with private methods, and try to ascertain, in a quick look, what the complexity of the code is.
So, imho, this doesn't hide the complexity, but, on the contrary, makes it clear.
As for the map/flatMap/filter definitions you mention, they do not belong to List or any other class, so they won't be applied. Basically,
for(x <- List(1, 2, 3)) yield x * 2
is translated into
List(1, 2, 3) map (x => x * 2)
and that is not the same thing as
map(List(1, 2, 3), ((x: Int) => x * 2)))
which is how the definition you passed would be called. For the record, the actual implementation of map on List is:
def map[B, That](f: A => B)(implicit bf: CanBuildFrom[Repr, B, That]): That = {
val b = bf(repr)
b.sizeHint(this)
for (x <- this) b += f(x)
b.result
}

I write code so that it's easy to understand and maintain. I then profile. If there's a bottleneck that's where I devote my attention. If it's in something like you've described I'll attack the problem in a different manner. Until then, I love the "sugar." It saves me the trouble of writing things out or thinking hard about it.

There are actually 6 loops. One loop for each filter/flatMap/map
The filter->map pairs can be done in one loop by using lazy views of the collections (iterator method)
In general, tt is running 2 nested loops for books to find all book pairs and then two nested loops to find if the author of one book is in the list of authors of the other.
Using simple data structures, you would do the same when coding explicitly.
And of course, the example here is to show a complex 'for' loop, not to write the most efficient code. E.g., instead of a sequence of authors, one could use a Set and then find if the intersection is non empty:
for (b1 <- books; b2 <- books; a <- (b1.authors & b2.authors)) yield a

Note that in 2.8, the filter call was changed to withFilter which is lazy and would avoid constructing an intermediate structure. See guide to move from filter to withFilter?.
I believe the reason that for is translated to map, flatMap and withFilter (as well as value definitions if present) is to make the use of monads easier.
In general I think if the computation you are doing involves looping 4 times, it is fine using the for loop. If the computation can be done more efficiently and performance is important then you should use the more efficient algorithm.

One follow-up to #IttayD's answer on the algorithm's efficiency. It's worth noting that the algorithm in the original post (and in the book) is a nested loop join. In practice, this isn't an efficient algorithm for large datasets, and most databases would use a hash aggregate here instead. In Scala, a hash aggregate would look something like:
(for (book <- books;
author <- book.authors) yield (book, author)
).groupBy(_._2).filter(_._2.size > 1).keys

Related

How to implement memoization in Scala without mutability?

I was recently reading Category Theory for Programmers and in one of the challenges, Bartosz proposed to write a function called memoize which takes a function as an argument and returns the same one with the difference that, the first time this new function is called, it stores the result of the argument and then returns this result each time it is called again.
def memoize[A, B](f: A => B): A => B = ???
The problem is, I can't think of any way to implement this function without resorting to mutability. Moreover, the implementations I have seen uses mutable data structures to accomplish the task.
My question is, is there a purely functional way of accomplishing this? Maybe without mutability or by using some functional trick?
Thanks for reading my question and for any future help. Have a nice day!
is there a purely functional way of accomplishing this?
No. Not in the narrowest sense of pure functions and using the given signature.
TLDR: Use mutable collections, it's okay!
Impurity of g
val g = memoize(f)
// state 1
g(a)
// state 2
What would you expect to happen for the call g(a)?
If g(a) memoizes the result, an (internal) state has to change, so the state is different after the call g(a) than before.
As this could be observed from the outside, the call to g has side effects, which makes your program impure.
From the Book you referenced, 2.5 Pure and Dirty Functions:
[...] functions that
always produce the same result given the same input and
have no side effects
are called pure functions.
Is this really a side effect?
Normally, at least in Scala, internal state changes are not considered side effects.
See the definition in the Scala Book
A pure function is a function that depends only on its declared inputs and its internal algorithm to produce its output. It does not read any other values from “the outside world” — the world outside of the function’s scope — and it does not modify any values in the outside world.
The following examples of lazy computations both change their internal states, but are normally still considered purely functional as they always yield the same result and have no side effects apart from internal state:
lazy val x = 1
// state 1: x is not computed
x
// state 2: x is 1
val ll = LazyList.continually(0)
// state 1: ll = LazyList(<not computed>)
ll(0)
// state 2: ll = LazyList(0, <not computed>)
In your case, the equivalent would be something using a private, mutable Map (as the implementations you may have found) like:
def memoize[A, B](f: A => B): A => B = {
val cache = mutable.Map.empty[A, B]
(a: A) => cache.getOrElseUpdate(a, f(a))
}
Note that the cache is not public.
So, for a pure function f and without looking at memory consumption, timings, reflection or other evil stuff, you won't be able to tell from the outside whether f was called twice or g cached the result of f.
In this sense, side effects are only things like printing output, writing to public variables, files etc.
Thus, this implementation is considered pure (at least in Scala).
Avoiding mutable collections
If you really want to avoid var and mutable collections, you need to change the signature of your memoize method.
This is, because if g cannot change internal state, it won't be able to memoize anything new after it was initialized.
An (inefficient but simple) example would be
def memoizeOneValue[A, B](f: A => B)(a: A): (B, A => B) = {
val b = f(a)
val g = (v: A) => if (v == a) b else f(v)
(b, g)
}
val (b1, g) = memoizeOneValue(f, a1)
val (b2, h) = memoizeOneValue(g, a2)
// ...
The result of f(a1) would be cached in g, but nothing else. Then, you could chain this and always get a new function.
If you are interested in a faster version of that, see #esse's answer, which does the same, but more efficient (using an immutable map, so O(log(n)) instead of the linked list of functions above, O(n)).
Let's try(Note: I have change the return type of memoize to store the cached data):
import scala.language.existentials
type M[A, B] = A => T forSome { type T <: (B, A => T) }
def memoize[A, B](f: A => B): M[A, B] = {
import scala.collection.immutable
def withCache(cache: immutable.Map[A, B]): M[A, B] = a => cache.get(a) match {
case Some(b) => (b, withCache(cache))
case None =>
val b = f(a)
(b, withCache(cache + (a -> b)))
}
withCache(immutable.Map.empty)
}
def f(i: Int): Int = { print(s"Invoke f($i)"); i }
val (i0, m0) = memoize(f)(1) // f only invoked at first time
val (i1, m1) = m0(1)
val (i2, m2) = m1(1)
Yes there is pure functional ways to implement polymorphic function memoization. The topic is surprisingly deep and even summons the Yoneda Lemma, which is likely what Bartosz had in mind with this exercise.
The blog post Memoization in Haskell gives a nice introduction by simplifying the problem a bit: instead of looking at arbitrary functions it restricts the problem to functions from the integers.
The following memoize function takes a function of type Int -> a and
returns a memoized version of the same function. The trick is to turn
a function into a value because, in Haskell, functions are not
memoized but values are. memoize converts a function f :: Int -> a
into an infinite list [a] whose nth element contains the value of f n.
Thus each element of the list is evaluated when it is first accessed
and cached automatically by the Haskell runtime thanks to lazy
evaluation.
memoize :: (Int -> a) -> (Int -> a)
memoize f = (map f [0 ..] !!)
Apparently the approach can be generalised to function of arbitrary domains. The trick is to come up with a way to use the type of the domain as an index into a lazy data structure used for "storing" previous values. And this is where the Yoneda Lemma comes in and my own understanding of the topic becomes flimsy.

What kind of morphism is `filter` in category theory?

In category theory, is the filter operation considered a morphism? If yes, what kind of morphism is it? Example (in Scala)
val myNums: Seq[Int] = Seq(-1, 3, -4, 2)
myNums.filter(_ > 0)
// Seq[Int] = List(3, 2) // result = subset, same type
myNums.filter(_ > -99)
// Seq[Int] = List(-1, 3, -4, 2) // result = identical than original
myNums.filter(_ > 99)
// Seq[Int] = List() // result = empty, same type
One interesting way of looking at this matter involves not picking filter as a primitive notion. There is a Haskell type class called Filterable which is aptly described as:
Like Functor, but it [includes] Maybe effects.
Formally, the class Filterable represents a functor from Kleisli Maybe to Hask.
The morphism mapping of the "functor from Kleisli Maybe to Hask" is captured by the mapMaybe method of the class, which is indeed a generalisation of the homonymous Data.Maybe function:
mapMaybe :: Filterable f => (a -> Maybe b) -> f a -> f b
The class laws are simply the appropriate functor laws (note that Just and (<=<) are, respectively, identity and composition in Kleisli Maybe):
mapMaybe Just = id
mapMaybe (g <=< f) = mapMaybe g . mapMaybe f
The class can also be expressed in terms of catMaybes...
catMaybes :: Filterable f => f (Maybe a) -> f a
... which is interdefinable with mapMaybe (cf. the analogous relationship between sequenceA and traverse)...
catMaybes = mapMaybe id
mapMaybe g = catMaybes . fmap g
... and amounts to a natural transformation between the Hask endofunctors Compose f Maybe and f.
What does all of that have to do with your question? Firstly, a functor is a morphism between categories, and a natural transformation is a morphism between functors. That being so, it is possible to talk of morphisms here in a sense that is less boring than the "morphisms in Hask" one. You won't necessarily want to do so, but in any case it is an existing vantage point.
Secondly, filter is, unsurprisingly, also a method of Filterable, its default definition being:
filter :: Filterable f => (a -> Bool) -> f a -> f a
filter p = mapMaybe $ \a -> if p a then Just a else Nothing
Or, to spell it using another cute combinator:
filter p = mapMaybe (ensure p)
That indirectly gives filter a place in this particular constellation of categorical notions.
To answer are question like this, I'd like to first understand what is the essence of filtering.
For instance, does it matter that the input is a list? Could you filter a tree? I don't see why not! You'd apply a predicate to each node of the tree and discard the ones that fail the test.
But what would be the shape of the result? Node deletion is not always defined or it's ambiguous. You could return a list. But why a list? Any data structure that supports appending would work. You also need an empty member of your data structure to start the appending process. So any unital magma would do. If you insist on associativity, you get a monoid. Looking back at the definition of filter, the result is a list, which is indeed a monoid. So we are on the right track.
So filter is just a special case of what's called Foldable: a data structure over which you can fold while accumulating the results in a monoid. In particular, you could use the predicate to either output a singleton list, if it's true; or an empty list (identity element), if it's false.
If you want a categorical answer, then a fold is an example of a catamorphism, an example of a morphism in the category of algebras. The (recursive) data structure you're folding over (a list, in the case of filter) is an initial algebra for some functor (the list functor, in this case), and your predicate is used to define an algebra for this functor.
In this answer, I will assume that you are talking about filter on Set (the situation seems messier for other datatypes).
Let's first fix what we are talking about. I will talk specifically about the following function (in Scala):
def filter[A](p: A => Boolean): Set[A] => Set[A] =
s => s filter p
When we write it down this way, we see clearly that it's a polymorphic function with type parameter A that maps predicates A => Boolean to functions that map Set[A] to other Set[A]. To make it a "morphism", we would have to find some categories first, in which this thing could be a "morphism". One might hope that it's natural transformation, and therefore a morphism in the category of endofunctors on the "default ambient category-esque structure" usually referred to as "Hask" (or "Scal"? "Scala"?). To show that it's natural, we would have to check that the following diagram commutes for every f: B => A:
- o f
Hom[A, Boolean] ---------------------> Hom[B, Boolean]
| |
| |
| |
| filter[A] | filter[B]
| |
V ??? V
Hom[Set[A], Set[A]] ---------------> Hom[Set[B], Set[B]]
however, here we fail immediately, because it's not clear what to even put on the horizontal arrow at the bottom, since the assignment A -> Hom[Set[A], Set[A]] doesn't even seem functorial (for the same reasons why A -> End[A] is not functorial, see here and also here).
The only "categorical" structure that I see here for a fixed type A is the following:
Predicates on A can be considered to be a partially ordered set with implication, that is p LEQ q if p implies q (i.e. either p(x) must be false, or q(x) must be true for all x: A).
Analogously, on functions Set[A] => Set[A], we can define a partial order with f LEQ g whenever for each set s: Set[A] it holds that f(s) is subset of g(s).
Then filter[A] would be monotonic, and therefore a functor between poset-categories. But that's somewhat boring.
Of course, for each fixed A, it (or rather its eta-expansion) is also just a function from A => Boolean to Set[A] => Set[A], so it's automatically a "morphism" in the "Hask-category". But that's even more boring.
filter can be written in terms of foldRight as:
filter p ys = foldRight(nil)( (x, xs) => if (p(x)) x::xs else xs ) ys
foldRight on lists is a map of T-algebras (where here T is the List datatype functor), so filter is a map of T-algebras.
The two algebras in question here are the initial list algebra
[nil, cons]: 1 + A x List(A) ----> List(A)
and, let's say the "filter" algebra,
[nil, f]: 1 + A x List(A) ----> List(A)
where f(x, xs) = if p(x) x::xs else xs.
Let's call filter(p, _) the unique map from the initial algebra to the filter algebra in this case (it is called fold in the general case). The fact that it is a map of algebras means that the following equations are satisfied:
filter(p, nil) = nil
filter(p, x::xs) = f(x, filter(p, xs))

What is meant by parallelism in this case?

In this blog post about parallel collections in Scala http://beust.com/weblog/2011/08/15/scalas-parallel-collections/
This is mentioned on a comment by Daniel Spiewak:
Other people have already commented about the mkString example, so I’m
going to leave it alone. It does reflect a much larger point about
collection semantics in general. Basically, this is it: in the absence
of side-effects (in user code), parallel collections have the
exact same semantics as the sequential collections. Put another way:
forAll { (xs: Vector[A], f: A => B) =>
(xs.par map f) == (xs map f)
}
Does this mean that if there are no side effects parallelism is not achieved? If this is true can this point be expanded to explain why this the case?
Does this mean that if there are no side effects parallelism is not
achieved?
No, that's not what it means. When Daniel Spiewak says that
Basically, this is it: in the absence of side-effects (in user code),
parallel collections have the exact same semantics as the sequential
collections.
It means that if your function has no side-effects then using it to map over a simple collection or a parallel collection will yield the same outcome. Which is why:
Put another way: forAll { (xs: Vector[A], f: A => B) => (xs.par map f)
== (xs map f) }
if f is side-effect free.
So, it's actually the opposite: if there are side effects parallelism is not a good idea since the outcome will be inconsistent.
It will always be run in parallel, but the result may differ when side-effects are there.
Let's say A = Int and B = Int with following code:
var tmp = false
def f(i: Int) = if(!tmp) {tmp = true; 0} else i + 1
So here we have a function f with a side-effect. I assume that tmp is false before running the code.
Running Vector(1,2,3).map(f) will always result in Vector(0,3,4)
But Vector(1,2,3).par.map(f) can have different results. It can be Vector(0,3,4) but since its parallel, the second element maybe is mapped first etc. So something like this might happen Vector(2,0,4).
You can be sure that the result will be the same, when in this case f would not have side-effects.

Why is zipWithIndex implemented in Iterable and not Traversable?

I'm reading "Programming in Scala 2ed". In section 24.4, it's noted that Iterable contains many method that cannot be efficiently written without an iterator. Table 24.2 contains these methods. However, I don't understand why some of them cannot be efficiently implemented on iterator. For example, consider zipWithIndex.
def zipWithIndex[A1 >: A, That](implicit bf: CanBuildFrom[Repr, (A1, Int), That]): That = {
val b = bf(repr)
var i = 0
for (x <- this) {
b += ((x, i))
i +=1
}
b.result
}
Why not move this definition to traversable? It seems to me that the code could be exactly the same and there would be no difference in efficienty.
You're completely correct, and your implementation should work. No good reason to have zipWithIndex defined in Iterable and not Traversable; neither makes any guarantee about the ordering of the elements under traversal.
(This is my first answer on StackOverflow. Hope I've been helpful. :) If I've not, please tell me.)
Traversable does not guarantee the order in which the elements will be visited and only requires you to define a foreach method with the following signature:
def foreach[U](f: Elem => U): Unit
Since this method just needs to call f for each element in any order, it doesn't make sense to have an index on elements since the order could be different for each invocation of foreach.
Edit: This is really just an explanation, why it's not on Traversable. As Luigi pointed out in the comments, zipWithIndex would make more sense on Seq.

Example of the Scala aggregate function

I have been looking and I cannot find an example or discussion of the aggregate function in Scala that I can understand. It seems pretty powerful.
Can this function be used to reduce the values of tuples to make a multimap-type collection? For example:
val list = Seq(("one", "i"), ("two", "2"), ("two", "ii"), ("one", "1"), ("four", "iv"))
After applying aggregate:
Seq(("one" -> Seq("i","1")), ("two" -> Seq("2", "ii")), ("four" -> Seq("iv"))
Also, can you give example of parameters z, segop, and combop? I'm unclear on what these parameters do.
Let's see if some ascii art doesn't help. Consider the type signature of aggregate:
def aggregate [B] (z: B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Also, note that A refers to the type of the collection. So, let's say we have 4 elements in this collection, then aggregate might work like this:
z A z A z A z A
\ / \ /seqop\ / \ /
B B B B
\ / combop \ /
B _ _ B
\ combop /
B
Let's see a practical example of that. Say I have a GenSeq("This", "is", "an", "example"), and I want to know how many characters there are in it. I can write the following:
Note the use of par in the below snippet of code. The second function passed to aggregate is what is called after the individual sequences are computed. Scala is only able to do this for sets that can be parallelized.
import scala.collection.GenSeq
val seq = GenSeq("This", "is", "an", "example")
val chars = seq.par.aggregate(0)(_ + _.length, _ + _)
So, first it would compute this:
0 + "This".length // 4
0 + "is".length // 2
0 + "an".length // 2
0 + "example".length // 7
What it does next cannot be predicted (there are more than one way of combining the results), but it might do this (like in the ascii art above):
4 + 2 // 6
2 + 7 // 9
At which point it concludes with
6 + 9 // 15
which gives the final result. Now, this is a bit similar in structure to foldLeft, but it has an additional function (B, B) => B, which fold doesn't have. This function, however, enables it to work in parallel!
Consider, for example, that each of the four computations initial computations are independent of each other, and can be done in parallel. The next two (resulting in 6 and 9) can be started once their computations on which they depend are finished, but these two can also run in parallel.
The 7 computations, parallelized as above, could take as little as the same time 3 serial computations.
Actually, with such a small collection the cost in synchronizing computation would be big enough to wipe out any gains. Furthermore, if you folded this, it would only take 4 computations total. Once your collections get larger, however, you start to see some real gains.
Consider, on the other hand, foldLeft. Because it doesn't have the additional function, it cannot parallelize any computation:
(((0 + "This".length) + "is".length) + "an".length) + "example".length
Each of the inner parenthesis must be computed before the outer one can proceed.
The aggregate function does not do that (except that it is a very general function, and it could be used to do that). You want groupBy. Close to at least. As you start with a Seq[(String, String)], and you group by taking the first item in the tuple (which is (String, String) => String), it would return a Map[String, Seq[(String, String)]). You then have to discard the first parameter in the Seq[String, String)] values.
So
list.groupBy(_._1).mapValues(_.map(_._2))
There you get a Map[String, Seq[(String, String)]. If you want a Seq instead of Map, call toSeq on the result. I don't think you have a guarantee on the order in the resulting Seq though
Aggregate is a more difficult function.
Consider first reduceLeft and reduceRight.
Let as be a non empty sequence as = Seq(a1, ... an) of elements of type A, and f: (A,A) => A be some way to combine two elements of type A into one. I will note it as a binary operator #, a1 # a2 rather than f(a1, a2). as.reduceLeft(#) will compute (((a1 # a2) # a3)... # an). reduceRight will put the parentheses the other way, (a1 # (a2 #... # an)))). If # happens to be associative, one does not care about the parentheses. One could compute it as (a1 #... # ap) # (ap+1 #...#an) (there would be parantheses inside the 2 big parantheses too, but let's not care about that). Then one could do the two parts in parallel, while the nested bracketing in reduceLeft or reduceRight force a fully sequential computation. But parallel computation is only possible when # is known to be associative, and the reduceLeft method cannot know that.
Still, there could be method reduce, whose caller would be responsible for ensuring that the operation is associative. Then reduce would order the calls as it sees fit, possibly doing them in parallel. Indeed, there is such a method.
There is a limitation with the various reduce methods however. The elements of the Seq can only be combined to a result of the same type: # has to be (A,A) => A. But one could have the more general problem of combining them into a B. One starts with a value b of type B, and combine it with every elements of the sequence. The operator # is (B,A) => B, and one computes (((b # a1) # a2) ... # an). foldLeft does that. foldRight does the same thing but starting with an. There, the # operation has no chance to be associative. When one writes b # a1 # a2, it must mean (b # a1) # a2, as (a1 # a2) would be ill-typed. So foldLeft and foldRight have to be sequential.
Suppose however, that each A can be turned into a B, let's write it with !, a! is of type B. Suppose moreover that there is a + operation (B,B) => B, and that # is such that b # a is in fact b + a!. Rather than combining elements with #, one could first transform all of them to B with !, then combine them with +. That would be as.map(!).reduceLeft(+). And if + is associative, then that can be done with reduce, and not be sequential: as.map(!).reduce(+). There could be an hypothetical method as.associativeFold(b, !, +).
Aggregate is very close to that. It may be however, that there is a more efficient way to implement b#a than b+a! For instance, if type B is List[A], and b#a is a::b, then a! will be a::Nil, and b1 + b2 will be b2 ::: b1. a::b is way better than (a::Nil):::b. To benefit from associativity, but still use #, one first splits b + a1! + ... + an!, into (b + a1! + ap!) + (ap+1! + ..+ an!), then go back to using # with (b # a1 # an) + (ap+1! # # an). One still needs the ! on ap+1, because one must start with some b. And the + is still necessary too, appearing between the parantheses. To do that, as.associativeFold(!, +) could be changed to as.optimizedAssociativeFold(b, !, #, +).
Back to +. + is associative, or equivalently, (B, +) is a semigroup. In practice, most of the semigroups used in programming happen to be monoids too, i.e they contain a neutral element z (for zero) in B, so that for each b, z + b = b + z = b. In that case, the ! operation that make sense is likely to be be a! = z # a. Moreover, as z is a neutral element b # a1 ..# an = (b + z) # a1 # an which is b + (z + a1 # an). So is is always possible to start the aggregation with z. If b is wanted instead, you do b + result at the end. With all those hypotheses, we can do as.aggregate(z, #, +). That is what aggregate does. # is the seqop argument (applied in a sequence z # a1 # a2 # ap), and + is combop (applied to already partially combined results, as in (z + a1#...#ap) + (z + ap+1#...#an)).
To sum it up, as.aggregate(z)(seqop, combop) computes the same thing as as.foldLeft(z)( seqop) provided that
(B, combop, z) is a monoid
seqop(b,a) = combop(b, seqop(z,a))
aggregate implementation may use the associativity of combop to group the computations as it likes (not swapping elements however, + has not to be commutative, ::: is not). It may run them in parallel.
Finally, solving the initial problem using aggregate is left as an exercise to the reader. A hint: implement using foldLeft, then find z and combo that will satisfy the conditions stated above.
The signature for a collection with elements of type A is:
def aggregate [B] (z: B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
z is an object of type B acting as a neutral element. If you want to count something, you can use 0, if you want to build a list, start with an empty list, etc.
segop is analoguous to the function you pass to fold methods. It takes two argument, the first one is the same type as the neutral element you passed and represent the stuff which was already aggregated on previous iteration, the second one is the next element of your collection. The result must also by of type B.
combop: is a function combining two results in one.
In most collections, aggregate is implemented in TraversableOnce as:
def aggregate[B](z: B)(seqop: (B, A) => B, combop: (B, B) => B): B
= foldLeft(z)(seqop)
Thus combop is ignored. However, it makes sense for parallel collections, becauseseqop will first be applied locally in parallel, and then combopis called to finish the aggregation.
So for your example, you can try with a fold first:
val seqOp =
(map:Map[String,Set[String]],tuple: (String,String)) =>
map + ( tuple._1 -> ( map.getOrElse( tuple._1, Set[String]() ) + tuple._2 ) )
list.foldLeft( Map[String,Set[String]]() )( seqOp )
// returns: Map(one -> Set(i, 1), two -> Set(2, ii), four -> Set(iv))
Then you have to find a way of collapsing two multimaps:
val combOp = (map1: Map[String,Set[String]], map2: Map[String,Set[String]]) =>
(map1.keySet ++ map2.keySet).foldLeft( Map[String,Set[String]]() ) {
(result,k) =>
result + ( k -> ( map1.getOrElse(k,Set[String]() ) ++ map2.getOrElse(k,Set[String]() ) ) )
}
Now, you can use aggregate in parallel:
list.par.aggregate( Map[String,Set[String]]() )( seqOp, combOp )
//Returns: Map(one -> Set(i, 1), two -> Set(2, ii), four -> Set(iv))
Applying the method "par" to list, thus using the parallel collection(scala.collection.parallel.immutable.ParSeq) of the list to really take advantage of the multi core processors. Without "par", there won't be any performance gain since the aggregate is not done on the parallel collection.
aggregate is like foldLeft but may executed in parallel.
As missingfactor says, the linear version of aggregate(z)(seqop, combop) is equivalent to foldleft(z)(seqop). This is however impractical in the parallel case, where we would need to combine not only the next element with the previous result (as in a normal fold) but we want to split the iterable into sub-iterables on which we call aggregate and need to combine those again. (In left-to-right order but not associative as we might have combined the last parts before the fist parts of the iterable.) This re-combining in in general non-trivial, and therefore, one needs a method (S, S) => S to accomplish that.
The definition in ParIterableLike is:
def aggregate[S](z: S)(seqop: (S, T) => S, combop: (S, S) => S): S = {
executeAndWaitResult(new Aggregate(z, seqop, combop, splitter))
}
which indeed uses combop.
For reference, Aggregate is defined as:
protected[this] class Aggregate[S](z: S, seqop: (S, T) => S, combop: (S, S) => S, protected[this] val pit: IterableSplitter[T])
extends Accessor[S, Aggregate[S]] {
#volatile var result: S = null.asInstanceOf[S]
def leaf(prevr: Option[S]) = result = pit.foldLeft(z)(seqop)
protected[this] def newSubtask(p: IterableSplitter[T]) = new Aggregate(z, seqop, combop, p)
override def merge(that: Aggregate[S]) = result = combop(result, that.result)
}
The important part is merge where combop is applied with two sub-results.
Here is the blog on how aggregate enable performance on the multi cores processor with bench mark.
http://markusjais.com/scalas-parallel-collections-and-the-aggregate-method/
Here is video on "Scala parallel collections" talk from "Scala Days 2011".
http://days2011.scala-lang.org/node/138/272
The description on the video
Scala Parallel Collections
Aleksandar Prokopec
Parallel programming abstractions become increasingly important as the number of processor cores grows. A high-level programming model enables the programmer to focus more on the program and less on low-level details such as synchronization and load-balancing. Scala parallel collections extend the programming model of the Scala collection framework, providing parallel operations on datasets.
The talk will describe the architecture of the parallel collection framework, explaining their implementation and design decisions. Concrete collection implementations such as parallel hash maps and parallel hash tries will be described. Finally, several example applications will be shown, demonstrating the programming model in practice.
The definition of aggregate in TraversableOnce source is:
def aggregate[B](z: B)(seqop: (B, A) => B, combop: (B, B) => B): B =
foldLeft(z)(seqop)
which is no different than a simple foldLeft. combop doesn't seem to be used anywhere. I am myself confused as to what the purpose of this method is.
Just to clarify explanations of those before me, in theory the idea is that
aggregate should work like this, (I have changed the names of the parameters to make them clearer):
Seq(1,2,3,4).aggragate(0)(
addToPrev = (prev,curr) => prev + curr,
combineSums = (sumA,sumB) => sumA + sumB)
Should logically translate to
Seq(1,2,3,4)
.grouped(2) // split into groups of 2 members each
.map(prevAndCurrList => prevAndCurrList(0) + prevAndCurrList(1))
.foldLeft(0)(sumA,sumB => sumA + sumB)
Because the aggregation and mapping are separate, the original list could theoretically be split into different groups of different sizes and run in parallel or even on different machine.
In practice scala current implementation does not support this feature by default but you can do this in your own code.