How does aggregate work in scala? - scala

I knew how a normal aggregate works in scala and its use over fold. Tried a lot to know how the below code works, but couldn't. Could someone help me in explaining how it works and gives me a output of (10,4)
val input=List(1,2,3,4)
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))

Could someone help me in explaining how it works and gives me a output
of (10,4)
When using aggregate, you provide three parameters:
the initial value from which you accumulate elements from a partition, often it's the neutral element
a function that given a partition, will accumulate the result within it
a function that will combine two partitions
So in your case, the initial value for a partition is the tuple (0, 0).
Then the accumulator function you defined will sum the current element you're traversing with the first element of the tuple and increment the second element of the tuple by one. In fact, it will compute the sum of the elements in a partition and its number of elements.
The combiner function combined two tuples. As you defined it, it will sum the sums and count the number of elements of 2 partitions. It's not used in your case because you traverse the pipeline sequentially. You could call .par on the List so that you get a parallel implementation to see the combiner in action (note that it has to be an associative function).
Thus you get (10, 4) because 1+2+3+4=10 and there was 4 elements in the list (you did 4 additions).
You could add a print statement in the accumulator function (running on a sequential input), to see how it behaves:
Acc: (0,0) - value:1
Acc: (1,1) - value:2
Acc: (3,2) - value:3
Acc: (6,3) - value:4
I knew how a normal aggregate works in scala and its use over fold.
For a sequential input, aggregate is a foldLeft:
def aggregate[B](z: =>B)(seqop: (B, A) => B, combop: (B, B) => B): B = foldLeft(z)(seqop)
For a parallel input, the list is split into chunks so that multiple threads can work separately. The accumulator function is run on each chunk, using the initial value. When two threads need to merge their results, the combine function is used:
def aggregate[S](z: =>S)(seqop: (S, T) => S, combop: (S, S) => S): S = {
tasksupport.executeAndWaitResult(new Aggregate(() => z, seqop, combop, splitter))
}
This is the principle of the fork-join model but it requires that your task can be parallelizable well. It's the case here, because a thread does not need to know the result of another thread to do its job.

Related

Why does scala reduce() and reduceLeft() reduce the sequence of the values in the same way?

If I run:
List(1, 4, 3, 9).reduce {(a, b) => println(s"$a + $b"); a + b}
The result is:
1 + 4
5 + 3
8 + 9
However, If I use reduceLeft instead of reduce, it also prints:
1 + 4
5 + 3
8 + 9
I have thought that reduce reduces the sequence of values in this way:
(1+4) + (3+9)
(5) + (12)
(17)
What is the real difference between reduceLeft and reduce?
Take a look at the type of the two functions:
def reduce[B >: A](op: (B, B) => B): B
def reduceLeft[B >: A](op: (B, A) => B): B
Notice, how the second argument to op in reduceLeft is A (same as the element type), while in reduce it is B (same as the return value).
This is an important difference. The operation in reduce must be associative, meaning that you can split the list into several segments, apply operation to each of them separately, and then apply it again to the results to combine.
This is important when you want to do things in parallel: reduce can be applied in parallel to different parts of the list, and then the results can be combined.
reduceLeft on the other hand, can only work sequentially (left to right, as the name implies), operation does not have to be associative, and can expect its second parameter to always be an element of the sequence (and the first one – result of the operation so far);
Consider
val sum = (1 to 100).reduce(_ + _)
This yields the same result as (1 to 100).reduceLeft(_ + _), except that the former can be parallelized.
On the other hand:
val strings = (1 to 100).reduceLeft[Any](_.toString + ";" + _.toString)
Should not be written as reduce (even though, in this contrived example, it can, and will even work as long as you stick with serial processing), because the result depends on the order in which elements are fed to the operation.
The difference is:
reduce is allowed to reduce in any order; for a List it happens to be the same as reduceLeft because it's more efficient (and it's the default implementation for IterableOnceOps subtypes, so most of them will be the same). For another collection it could be equivalent to reduceRight, and for a tree type it may be more as you expected.
different type signatures:
reduce[B >: A](op: (B, B) => B): B
versus
reduceLeft[B >: A](op: (B, A) => B): B
because the first parameter to op in reduceLeft is always an element of the collection it's called on, but for reduce it could be the result of a recursive reduce call.

How reduceLeft works on sequence of Functions returning Future

This might be a naive question and I am sorry for that. I am studying Scala Futures and stumbled on below code:
object Main extends App {
def await[T](f: Future[T]) = Await.result(f, 10.seconds)
def f(n: Int): Future[Int] = Future {n + 1}
def g(n: Int): Future[Int] = Future {n * 2}
def h(n: Int): Future[Int] = Future {n - 1}
def doAllInOrder[T](f: (T => Future[T])*): T => Future[T] = {
f.reduceLeft((a,b) => x => a(x).flatMap(y => b(y)))
}
println(await(doAllInOrder(f, g, h)(10))) // 21
}
I know how reduceLeft works when it is applied on Collections. But in the above example, as I understand, in the first pass of reduceLeft value x i.e. 10 is applied to Function a and the result is applied to Function b using flatMap which eventually return Future[Int] (say, I call it result). In the next pass of reduceLeft the result and Function h has to be used, but here is I am troubled.
The result is actually an already executed Future, but the reduceLeft next pass expects a Function which returns a Future[Int].Then how it is working?
Another thing I am not able to understand how each pass sending its result to next pass of reduceLeft, i.e. how x is getting it's value in subsequent passes.
Though both of my confusions are interrelated and a good explanation may help clear my doubt.
Thanks in advance.
You have to think about reduceLeft to be independent from Future execution. reduceLeft creates a new function by combining two given ones and that's it.
reduceLeft is applied to a Seq of T => Future[T]. So, it's just simple iteration from left to right over sequence of functions taking first and second elements and reducing it to one single value, reducing this value with the 3rd element and so on. Eventually, having just two element left that are reduced to a single one.
The result of reduceLeft has to be of the same type as the type of elements in the collection. In your case, it's function T => Future[T].
Let's understand what (a,b) => x => a(x).flatMap(y => b(y)) is doing
This means the following. Given functions a and b, create a function that combines function a with b. Mathematically it's c(x)=b(a(x)).
Now, a and b are functions returning futures. And futures can be chained with the help of map/flatMap methods.
You should read x => a(x).flatMap(y => b(y)) as
Given an input x, apply function a(x), this results in a Future, when this future is completed, take result y and apply function b(y), this results in a new Future. This is the result of function c.
Note: value x is Int at all times. It is the input parameter for your new reduced function.
If it's still not clear, let's address the points of confusions
The result is actually an already executed Future.
No future is guaranteed to be executed at any point here. map and flatMap are non blocking operation and it applies functions to a Future. The result of this application is still a Future.
Another thing I am not able to understand how each pass sending its
result to next pass of reduceLeft, i.e. how x is getting it's value in
subsequent passes.
This is easier to understand when just having integers in a collection. Given following code
Seq(1, 2, 5, 10).reduceLeft(_ - _)
It will take 1, 2 and apply - function, this will result in -1. It will then combine -1 and 5 resulting in -6. And finally, -6 with 10 resulting in -16.

Can someone explain this scala aggregate function with two initial values

I am very new to Scala this problem I was trying to solve in spark, which also uses Scala for performing operation on RDD's.
Till now, I have only seen aggregate functions with only one initial value (i,e some-input.aggregate(Initial-value)((acc,value)=>(acc+value))), but this program has two initial values (0,0).
As per my understanding this program is for calculating the running average and keeping track of the count so far.
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
I know that in foldLeft / aggregate we supply initial values, so that in case of empty collection we get the default value, and both have accumulator and value part.
But in this case, we have two initial values, and accumulator is accessing tuple values. Where is this tuple defined?
Can someone please explain this whole program line by line.
but this program has two initial values (0,0).
They aren't two parameters, they're one Tuple2:
input.aggregate((0, 0))
The value passed to aggregate is surrounded by additional round brackets, (( )), which are used as syntactic sugar for Tuple2.apply. This is where you're seeing the tuple come from.
If you look a the method definition (I'm assuming this is RDD.aggregate), you'll see it takes a single parameter in the first argument list:
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
(implicit arg0: ClassTag[U]): U

Explanation of the aggregate scala function

I do not get to understand yet the aggregate function:
For example, having:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
The result will be: (21,6)
Well, I think that (x,y) => (x._1 + y._1, x._2 + y._2) is to get the result in parallel, for example it will be (1 + 2, 1 + 1) and so on.
But exactly this part that leaves me confused:
(x, y) => (x._1 + y, x._2 + 1)
why x._1 + y? and here x._2 is 0?
Thanks in advance.
First of all Thanks to Diego's reply which helped me connect the dots in understanding aggregate() function..
Let me confess that I couldn't sleep last night properly because I couldn't get how aggregate() works internally, I'll get good sleep tonight definitely :-)
Let's start understanding it
val result = List(1,2,3,4,5,6,7,8,9,10).par.aggregate((0, 0))
(
(x, y) => (x._1 + y, x._2 + 1),
(x,y) =>(x._1 + y._1, x._2 + y._2)
)
result: (Int, Int) = (55,10)
aggregate function has 3 parts :
initial value of accumulators : tuple(0,0) here
seqop : It works like foldLeft with initial value of 0
combop : It combines the result generated through parallelization (this part was difficult for me to understand)
Let's understand all 3 parts independently :
part-1 : Initial tuple (0,0)
Aggregate() starts with initial value of accumulators x which is (0,0) here. First tuple x._1 which is initially 0 is used to compute the sum, Second tuple x._2 is used to compute total number of elements in the list.
part-2 : (x, y) => (x._1 + y, x._2 + 1)
If you know how foldLeft works in scala then it should be easy to understand this part. Above function works just like foldLeft on our List(1,2,3,4...10).
Iteration# (x._1 + y, x._2 + 1)
1 (0+1, 0+1)
2 (1+2, 1+1)
3 (3+3, 2+1)
4 (6+4, 3+1)
. ....
. ....
10 (45+10, 9+1)
thus after all 10 iteration you'll get the result (55,10).
If you understand this part the rest is very easy but for me it was the most difficult part in understanding if all the required computation are finished then what is the use of second part i.e. compop - stay tuned :-)
part 3 : (x,y) =>(x._1 + y._1, x._2 + y._2)
Well this 3rd part is combOp which combines the result generated by different threads during parallelization, remember we used 'par' in our code to enable parallel computation of list :
List(1,2,3,4,5,6,7,8,9,10).par.aggregate(....)
Apache spark is effectively using aggregate function to do parallel computation of RDD.
Let's assume that our List(1,2,3,4,5,6,7,8,9,10) is being computed by 3 threads in parallel. Here each thread is working on partial list and then our aggregate() combOp will combine the result of each thread's computation using the below code :
(x,y) =>(x._1 + y._1, x._2 + y._2)
Original list : List(1,2,3,4,5,6,7,8,9,10)
Thread1 start computing on partial list say (1,2,3,4), Thread2 computes (5,6,7,8) and Thread3 computes partial list say (9,10)
At the end of computation, Thread-1 result will be (10,4), Thread-2 result will be (26,4) and Thread-3 result will be (19,2).
At the end of parallel computation, we'll have ((10,4),(26,4),(19,2))
Iteration# (x._1 + y._1, x._2 + y._2)
1 (0+10, 0+4)
2 (10+26, 4+4)
3 (36+19, 8+2)
which is (55,10).
Finally let me re-iterate that seqOp job is to compute the sum of all the elements of list and total number of list whereas combine function's job is to combine different partial result generated during parallelization.
I hope above explanation help you understand the aggregate().
From the documentation:
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Aggregates the results of applying an operator to subsequent elements.
This is a more general form of fold and reduce. It has similar
semantics, but does not require the result to be a supertype of the
element type. It traverses the elements in different partitions
sequentially, using seqop to update the result, and then applies
combop to results from different partitions. The implementation of
this operation may operate on an arbitrary number of collection
partitions, so combop may be invoked an arbitrary number of times.
For example, one might want to process some elements and then produce
a Set. In this case, seqop would process an element and append it to
the list, while combop would concatenate two lists from different
partitions together. The initial value z would be an empty set.
pc.aggregate(Set[Int]())(_ += process(_), _ ++ _)
Another example is
calculating geometric mean from a collection of doubles (one would
typically require big doubles for this). B the type of accumulated
results z the initial value for the accumulated result of the
partition - this will typically be the neutral element for the seqop
operator (e.g. Nil for list concatenation or 0 for summation) and may
be evaluated more than once seqop an operator used to accumulate
results within a partition combop an associative operator used to
combine results from different partitions
In your example B is a Tuple2[Int, Int]. The method seqop then takes a single element from the list, scoped as y, and updates the aggregate B to (x._1 + y, x._2 + 1). So it increments the second element in the tuple. This effectively puts the sum of elements into the first element of the tuple and the number of elements into the second element of the tuple.
The method combop then takes the results from each parallel execution thread and combines them. Combination by addition provides the same results as if it were run on the list sequentially.
Using B as a tuple is likely the confusing piece of this. You can break the problem down into two sub problems to get a better idea of what this is doing. res0 is the first element in the result tuple, and res1 is the second element in the result tuple.
// Sums all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + y, (x, y) => x + y)
res0: Int = 21
// Counts all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + 1, (x, y) => x + y)
res1: Int = 6
aggregate takes 3 parameters: a seed value, a computation function and a combination function.
What it does is basically split the collection in a number of threads, compute partial results using the computation function and then combine all these partial results using the combination function.
From what I can tell, your example function will return a pair (a, b) where a is the sum of the values in the list, b is the number of values in the list. Indeed, (21, 6).
How does this work? The seed value is the (0,0) pair. For an empty list, we have a sum of 0 and a number of items 0, so this is correct.
Your computation function takes an (Int, Int) pair x, which is your partial result, and a Int y, which is the next value in the list. This is your:
(x, y) => (x._1 + y, x._2 + 1)
Indeed, the result that we want is to increase the left element of x (the accumulator) by y, and the right element of x (the counter) by 1 for each y.
Your combination function takes an (Int, Int) pair x and an (Int, Int) pair y, which are your two partial results from different parallel computations, and combines them together as:
(x,y) => (x._1 + y._1, x._2 + y._2)
Indeed, we sum independently the left parts of the pairs and right parts of the pairs.
Your confusion comes from the fact that x and y in the first function ARE NOT the same x and y of the second function. In the first function, you have x of the type of the seed value, and y of the type of the collection elements, and you return a result of the type of x. In the second function, your two parameters are both of the same type of your seed value.
Hope it's clearer now!
Adding to Rashmit answer.
CombOp is called only if the collection is processed in parallel mode.
See below example :
val listP: ParSeq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).par
val aggregateOp1 = listP.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp1)
OP : Aggregate-1,Aggregate-2,Aggregate-3,Aggregate-45,Aggregate-6,Aggregate-7,Aggregate-8,Aggregate-910
val list: Seq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val aggregateOp2 = list.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp2)
}
OP : Aggregate-12345678910
In above example, combiner operation is called only if collection is operated in parallel
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Breaking that down a little :
aggregate(accumulator)(accumulator+first_elem_of_list, (seq1,seq2)=>seq1+seq2)
Now looking at the example:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
Here:
Accumulator is (0,0)
Defined list is x
First elem of x is 1
So for each iteration, we are taking the accumulator and adding the elements of x to position 1 of the accumulator to get the sum and increasing position 2 of the accumulator by 1 to get the count. (y is the elements of the list)
(x, y) => (x._1 + y, x._2 + 1)
Now, since this is a parallel implementation, the first portion will give rise to a list of tuples like (3,2) (7,2) and (11,2). index 1 = Sum, index 2 = count of elements used to generate sum. Now the second portion comes into play. The elements of each sequence are added in a reduce fashion.
(x,y) =>(x._1 + y._1, x._2 + y._2)
rewriting with more meaningful variables:
val arr = Array(1,2,3,4,5,6)
arr.par.aggregate((0,0))((accumulator,list_elem)=>(accumulator._1+list_elem, accumulator._2+1), (seq1, seq2)=> (seq1._1+seq2._1, seq1._2+seq2._2))

Example of the Scala aggregate function

I have been looking and I cannot find an example or discussion of the aggregate function in Scala that I can understand. It seems pretty powerful.
Can this function be used to reduce the values of tuples to make a multimap-type collection? For example:
val list = Seq(("one", "i"), ("two", "2"), ("two", "ii"), ("one", "1"), ("four", "iv"))
After applying aggregate:
Seq(("one" -> Seq("i","1")), ("two" -> Seq("2", "ii")), ("four" -> Seq("iv"))
Also, can you give example of parameters z, segop, and combop? I'm unclear on what these parameters do.
Let's see if some ascii art doesn't help. Consider the type signature of aggregate:
def aggregate [B] (z: B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Also, note that A refers to the type of the collection. So, let's say we have 4 elements in this collection, then aggregate might work like this:
z A z A z A z A
\ / \ /seqop\ / \ /
B B B B
\ / combop \ /
B _ _ B
\ combop /
B
Let's see a practical example of that. Say I have a GenSeq("This", "is", "an", "example"), and I want to know how many characters there are in it. I can write the following:
Note the use of par in the below snippet of code. The second function passed to aggregate is what is called after the individual sequences are computed. Scala is only able to do this for sets that can be parallelized.
import scala.collection.GenSeq
val seq = GenSeq("This", "is", "an", "example")
val chars = seq.par.aggregate(0)(_ + _.length, _ + _)
So, first it would compute this:
0 + "This".length // 4
0 + "is".length // 2
0 + "an".length // 2
0 + "example".length // 7
What it does next cannot be predicted (there are more than one way of combining the results), but it might do this (like in the ascii art above):
4 + 2 // 6
2 + 7 // 9
At which point it concludes with
6 + 9 // 15
which gives the final result. Now, this is a bit similar in structure to foldLeft, but it has an additional function (B, B) => B, which fold doesn't have. This function, however, enables it to work in parallel!
Consider, for example, that each of the four computations initial computations are independent of each other, and can be done in parallel. The next two (resulting in 6 and 9) can be started once their computations on which they depend are finished, but these two can also run in parallel.
The 7 computations, parallelized as above, could take as little as the same time 3 serial computations.
Actually, with such a small collection the cost in synchronizing computation would be big enough to wipe out any gains. Furthermore, if you folded this, it would only take 4 computations total. Once your collections get larger, however, you start to see some real gains.
Consider, on the other hand, foldLeft. Because it doesn't have the additional function, it cannot parallelize any computation:
(((0 + "This".length) + "is".length) + "an".length) + "example".length
Each of the inner parenthesis must be computed before the outer one can proceed.
The aggregate function does not do that (except that it is a very general function, and it could be used to do that). You want groupBy. Close to at least. As you start with a Seq[(String, String)], and you group by taking the first item in the tuple (which is (String, String) => String), it would return a Map[String, Seq[(String, String)]). You then have to discard the first parameter in the Seq[String, String)] values.
So
list.groupBy(_._1).mapValues(_.map(_._2))
There you get a Map[String, Seq[(String, String)]. If you want a Seq instead of Map, call toSeq on the result. I don't think you have a guarantee on the order in the resulting Seq though
Aggregate is a more difficult function.
Consider first reduceLeft and reduceRight.
Let as be a non empty sequence as = Seq(a1, ... an) of elements of type A, and f: (A,A) => A be some way to combine two elements of type A into one. I will note it as a binary operator #, a1 # a2 rather than f(a1, a2). as.reduceLeft(#) will compute (((a1 # a2) # a3)... # an). reduceRight will put the parentheses the other way, (a1 # (a2 #... # an)))). If # happens to be associative, one does not care about the parentheses. One could compute it as (a1 #... # ap) # (ap+1 #...#an) (there would be parantheses inside the 2 big parantheses too, but let's not care about that). Then one could do the two parts in parallel, while the nested bracketing in reduceLeft or reduceRight force a fully sequential computation. But parallel computation is only possible when # is known to be associative, and the reduceLeft method cannot know that.
Still, there could be method reduce, whose caller would be responsible for ensuring that the operation is associative. Then reduce would order the calls as it sees fit, possibly doing them in parallel. Indeed, there is such a method.
There is a limitation with the various reduce methods however. The elements of the Seq can only be combined to a result of the same type: # has to be (A,A) => A. But one could have the more general problem of combining them into a B. One starts with a value b of type B, and combine it with every elements of the sequence. The operator # is (B,A) => B, and one computes (((b # a1) # a2) ... # an). foldLeft does that. foldRight does the same thing but starting with an. There, the # operation has no chance to be associative. When one writes b # a1 # a2, it must mean (b # a1) # a2, as (a1 # a2) would be ill-typed. So foldLeft and foldRight have to be sequential.
Suppose however, that each A can be turned into a B, let's write it with !, a! is of type B. Suppose moreover that there is a + operation (B,B) => B, and that # is such that b # a is in fact b + a!. Rather than combining elements with #, one could first transform all of them to B with !, then combine them with +. That would be as.map(!).reduceLeft(+). And if + is associative, then that can be done with reduce, and not be sequential: as.map(!).reduce(+). There could be an hypothetical method as.associativeFold(b, !, +).
Aggregate is very close to that. It may be however, that there is a more efficient way to implement b#a than b+a! For instance, if type B is List[A], and b#a is a::b, then a! will be a::Nil, and b1 + b2 will be b2 ::: b1. a::b is way better than (a::Nil):::b. To benefit from associativity, but still use #, one first splits b + a1! + ... + an!, into (b + a1! + ap!) + (ap+1! + ..+ an!), then go back to using # with (b # a1 # an) + (ap+1! # # an). One still needs the ! on ap+1, because one must start with some b. And the + is still necessary too, appearing between the parantheses. To do that, as.associativeFold(!, +) could be changed to as.optimizedAssociativeFold(b, !, #, +).
Back to +. + is associative, or equivalently, (B, +) is a semigroup. In practice, most of the semigroups used in programming happen to be monoids too, i.e they contain a neutral element z (for zero) in B, so that for each b, z + b = b + z = b. In that case, the ! operation that make sense is likely to be be a! = z # a. Moreover, as z is a neutral element b # a1 ..# an = (b + z) # a1 # an which is b + (z + a1 # an). So is is always possible to start the aggregation with z. If b is wanted instead, you do b + result at the end. With all those hypotheses, we can do as.aggregate(z, #, +). That is what aggregate does. # is the seqop argument (applied in a sequence z # a1 # a2 # ap), and + is combop (applied to already partially combined results, as in (z + a1#...#ap) + (z + ap+1#...#an)).
To sum it up, as.aggregate(z)(seqop, combop) computes the same thing as as.foldLeft(z)( seqop) provided that
(B, combop, z) is a monoid
seqop(b,a) = combop(b, seqop(z,a))
aggregate implementation may use the associativity of combop to group the computations as it likes (not swapping elements however, + has not to be commutative, ::: is not). It may run them in parallel.
Finally, solving the initial problem using aggregate is left as an exercise to the reader. A hint: implement using foldLeft, then find z and combo that will satisfy the conditions stated above.
The signature for a collection with elements of type A is:
def aggregate [B] (z: B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
z is an object of type B acting as a neutral element. If you want to count something, you can use 0, if you want to build a list, start with an empty list, etc.
segop is analoguous to the function you pass to fold methods. It takes two argument, the first one is the same type as the neutral element you passed and represent the stuff which was already aggregated on previous iteration, the second one is the next element of your collection. The result must also by of type B.
combop: is a function combining two results in one.
In most collections, aggregate is implemented in TraversableOnce as:
def aggregate[B](z: B)(seqop: (B, A) => B, combop: (B, B) => B): B
= foldLeft(z)(seqop)
Thus combop is ignored. However, it makes sense for parallel collections, becauseseqop will first be applied locally in parallel, and then combopis called to finish the aggregation.
So for your example, you can try with a fold first:
val seqOp =
(map:Map[String,Set[String]],tuple: (String,String)) =>
map + ( tuple._1 -> ( map.getOrElse( tuple._1, Set[String]() ) + tuple._2 ) )
list.foldLeft( Map[String,Set[String]]() )( seqOp )
// returns: Map(one -> Set(i, 1), two -> Set(2, ii), four -> Set(iv))
Then you have to find a way of collapsing two multimaps:
val combOp = (map1: Map[String,Set[String]], map2: Map[String,Set[String]]) =>
(map1.keySet ++ map2.keySet).foldLeft( Map[String,Set[String]]() ) {
(result,k) =>
result + ( k -> ( map1.getOrElse(k,Set[String]() ) ++ map2.getOrElse(k,Set[String]() ) ) )
}
Now, you can use aggregate in parallel:
list.par.aggregate( Map[String,Set[String]]() )( seqOp, combOp )
//Returns: Map(one -> Set(i, 1), two -> Set(2, ii), four -> Set(iv))
Applying the method "par" to list, thus using the parallel collection(scala.collection.parallel.immutable.ParSeq) of the list to really take advantage of the multi core processors. Without "par", there won't be any performance gain since the aggregate is not done on the parallel collection.
aggregate is like foldLeft but may executed in parallel.
As missingfactor says, the linear version of aggregate(z)(seqop, combop) is equivalent to foldleft(z)(seqop). This is however impractical in the parallel case, where we would need to combine not only the next element with the previous result (as in a normal fold) but we want to split the iterable into sub-iterables on which we call aggregate and need to combine those again. (In left-to-right order but not associative as we might have combined the last parts before the fist parts of the iterable.) This re-combining in in general non-trivial, and therefore, one needs a method (S, S) => S to accomplish that.
The definition in ParIterableLike is:
def aggregate[S](z: S)(seqop: (S, T) => S, combop: (S, S) => S): S = {
executeAndWaitResult(new Aggregate(z, seqop, combop, splitter))
}
which indeed uses combop.
For reference, Aggregate is defined as:
protected[this] class Aggregate[S](z: S, seqop: (S, T) => S, combop: (S, S) => S, protected[this] val pit: IterableSplitter[T])
extends Accessor[S, Aggregate[S]] {
#volatile var result: S = null.asInstanceOf[S]
def leaf(prevr: Option[S]) = result = pit.foldLeft(z)(seqop)
protected[this] def newSubtask(p: IterableSplitter[T]) = new Aggregate(z, seqop, combop, p)
override def merge(that: Aggregate[S]) = result = combop(result, that.result)
}
The important part is merge where combop is applied with two sub-results.
Here is the blog on how aggregate enable performance on the multi cores processor with bench mark.
http://markusjais.com/scalas-parallel-collections-and-the-aggregate-method/
Here is video on "Scala parallel collections" talk from "Scala Days 2011".
http://days2011.scala-lang.org/node/138/272
The description on the video
Scala Parallel Collections
Aleksandar Prokopec
Parallel programming abstractions become increasingly important as the number of processor cores grows. A high-level programming model enables the programmer to focus more on the program and less on low-level details such as synchronization and load-balancing. Scala parallel collections extend the programming model of the Scala collection framework, providing parallel operations on datasets.
The talk will describe the architecture of the parallel collection framework, explaining their implementation and design decisions. Concrete collection implementations such as parallel hash maps and parallel hash tries will be described. Finally, several example applications will be shown, demonstrating the programming model in practice.
The definition of aggregate in TraversableOnce source is:
def aggregate[B](z: B)(seqop: (B, A) => B, combop: (B, B) => B): B =
foldLeft(z)(seqop)
which is no different than a simple foldLeft. combop doesn't seem to be used anywhere. I am myself confused as to what the purpose of this method is.
Just to clarify explanations of those before me, in theory the idea is that
aggregate should work like this, (I have changed the names of the parameters to make them clearer):
Seq(1,2,3,4).aggragate(0)(
addToPrev = (prev,curr) => prev + curr,
combineSums = (sumA,sumB) => sumA + sumB)
Should logically translate to
Seq(1,2,3,4)
.grouped(2) // split into groups of 2 members each
.map(prevAndCurrList => prevAndCurrList(0) + prevAndCurrList(1))
.foldLeft(0)(sumA,sumB => sumA + sumB)
Because the aggregation and mapping are separate, the original list could theoretically be split into different groups of different sizes and run in parallel or even on different machine.
In practice scala current implementation does not support this feature by default but you can do this in your own code.