Explanation of the aggregate scala function - scala

I do not get to understand yet the aggregate function:
For example, having:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
The result will be: (21,6)
Well, I think that (x,y) => (x._1 + y._1, x._2 + y._2) is to get the result in parallel, for example it will be (1 + 2, 1 + 1) and so on.
But exactly this part that leaves me confused:
(x, y) => (x._1 + y, x._2 + 1)
why x._1 + y? and here x._2 is 0?
Thanks in advance.

First of all Thanks to Diego's reply which helped me connect the dots in understanding aggregate() function..
Let me confess that I couldn't sleep last night properly because I couldn't get how aggregate() works internally, I'll get good sleep tonight definitely :-)
Let's start understanding it
val result = List(1,2,3,4,5,6,7,8,9,10).par.aggregate((0, 0))
(
(x, y) => (x._1 + y, x._2 + 1),
(x,y) =>(x._1 + y._1, x._2 + y._2)
)
result: (Int, Int) = (55,10)
aggregate function has 3 parts :
initial value of accumulators : tuple(0,0) here
seqop : It works like foldLeft with initial value of 0
combop : It combines the result generated through parallelization (this part was difficult for me to understand)
Let's understand all 3 parts independently :
part-1 : Initial tuple (0,0)
Aggregate() starts with initial value of accumulators x which is (0,0) here. First tuple x._1 which is initially 0 is used to compute the sum, Second tuple x._2 is used to compute total number of elements in the list.
part-2 : (x, y) => (x._1 + y, x._2 + 1)
If you know how foldLeft works in scala then it should be easy to understand this part. Above function works just like foldLeft on our List(1,2,3,4...10).
Iteration# (x._1 + y, x._2 + 1)
1 (0+1, 0+1)
2 (1+2, 1+1)
3 (3+3, 2+1)
4 (6+4, 3+1)
. ....
. ....
10 (45+10, 9+1)
thus after all 10 iteration you'll get the result (55,10).
If you understand this part the rest is very easy but for me it was the most difficult part in understanding if all the required computation are finished then what is the use of second part i.e. compop - stay tuned :-)
part 3 : (x,y) =>(x._1 + y._1, x._2 + y._2)
Well this 3rd part is combOp which combines the result generated by different threads during parallelization, remember we used 'par' in our code to enable parallel computation of list :
List(1,2,3,4,5,6,7,8,9,10).par.aggregate(....)
Apache spark is effectively using aggregate function to do parallel computation of RDD.
Let's assume that our List(1,2,3,4,5,6,7,8,9,10) is being computed by 3 threads in parallel. Here each thread is working on partial list and then our aggregate() combOp will combine the result of each thread's computation using the below code :
(x,y) =>(x._1 + y._1, x._2 + y._2)
Original list : List(1,2,3,4,5,6,7,8,9,10)
Thread1 start computing on partial list say (1,2,3,4), Thread2 computes (5,6,7,8) and Thread3 computes partial list say (9,10)
At the end of computation, Thread-1 result will be (10,4), Thread-2 result will be (26,4) and Thread-3 result will be (19,2).
At the end of parallel computation, we'll have ((10,4),(26,4),(19,2))
Iteration# (x._1 + y._1, x._2 + y._2)
1 (0+10, 0+4)
2 (10+26, 4+4)
3 (36+19, 8+2)
which is (55,10).
Finally let me re-iterate that seqOp job is to compute the sum of all the elements of list and total number of list whereas combine function's job is to combine different partial result generated during parallelization.
I hope above explanation help you understand the aggregate().

From the documentation:
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Aggregates the results of applying an operator to subsequent elements.
This is a more general form of fold and reduce. It has similar
semantics, but does not require the result to be a supertype of the
element type. It traverses the elements in different partitions
sequentially, using seqop to update the result, and then applies
combop to results from different partitions. The implementation of
this operation may operate on an arbitrary number of collection
partitions, so combop may be invoked an arbitrary number of times.
For example, one might want to process some elements and then produce
a Set. In this case, seqop would process an element and append it to
the list, while combop would concatenate two lists from different
partitions together. The initial value z would be an empty set.
pc.aggregate(Set[Int]())(_ += process(_), _ ++ _)
Another example is
calculating geometric mean from a collection of doubles (one would
typically require big doubles for this). B the type of accumulated
results z the initial value for the accumulated result of the
partition - this will typically be the neutral element for the seqop
operator (e.g. Nil for list concatenation or 0 for summation) and may
be evaluated more than once seqop an operator used to accumulate
results within a partition combop an associative operator used to
combine results from different partitions
In your example B is a Tuple2[Int, Int]. The method seqop then takes a single element from the list, scoped as y, and updates the aggregate B to (x._1 + y, x._2 + 1). So it increments the second element in the tuple. This effectively puts the sum of elements into the first element of the tuple and the number of elements into the second element of the tuple.
The method combop then takes the results from each parallel execution thread and combines them. Combination by addition provides the same results as if it were run on the list sequentially.
Using B as a tuple is likely the confusing piece of this. You can break the problem down into two sub problems to get a better idea of what this is doing. res0 is the first element in the result tuple, and res1 is the second element in the result tuple.
// Sums all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + y, (x, y) => x + y)
res0: Int = 21
// Counts all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + 1, (x, y) => x + y)
res1: Int = 6

aggregate takes 3 parameters: a seed value, a computation function and a combination function.
What it does is basically split the collection in a number of threads, compute partial results using the computation function and then combine all these partial results using the combination function.
From what I can tell, your example function will return a pair (a, b) where a is the sum of the values in the list, b is the number of values in the list. Indeed, (21, 6).
How does this work? The seed value is the (0,0) pair. For an empty list, we have a sum of 0 and a number of items 0, so this is correct.
Your computation function takes an (Int, Int) pair x, which is your partial result, and a Int y, which is the next value in the list. This is your:
(x, y) => (x._1 + y, x._2 + 1)
Indeed, the result that we want is to increase the left element of x (the accumulator) by y, and the right element of x (the counter) by 1 for each y.
Your combination function takes an (Int, Int) pair x and an (Int, Int) pair y, which are your two partial results from different parallel computations, and combines them together as:
(x,y) => (x._1 + y._1, x._2 + y._2)
Indeed, we sum independently the left parts of the pairs and right parts of the pairs.
Your confusion comes from the fact that x and y in the first function ARE NOT the same x and y of the second function. In the first function, you have x of the type of the seed value, and y of the type of the collection elements, and you return a result of the type of x. In the second function, your two parameters are both of the same type of your seed value.
Hope it's clearer now!

Adding to Rashmit answer.
CombOp is called only if the collection is processed in parallel mode.
See below example :
val listP: ParSeq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).par
val aggregateOp1 = listP.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp1)
OP : Aggregate-1,Aggregate-2,Aggregate-3,Aggregate-45,Aggregate-6,Aggregate-7,Aggregate-8,Aggregate-910
val list: Seq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val aggregateOp2 = list.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp2)
}
OP : Aggregate-12345678910
In above example, combiner operation is called only if collection is operated in parallel

def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Breaking that down a little :
aggregate(accumulator)(accumulator+first_elem_of_list, (seq1,seq2)=>seq1+seq2)
Now looking at the example:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
Here:
Accumulator is (0,0)
Defined list is x
First elem of x is 1
So for each iteration, we are taking the accumulator and adding the elements of x to position 1 of the accumulator to get the sum and increasing position 2 of the accumulator by 1 to get the count. (y is the elements of the list)
(x, y) => (x._1 + y, x._2 + 1)
Now, since this is a parallel implementation, the first portion will give rise to a list of tuples like (3,2) (7,2) and (11,2). index 1 = Sum, index 2 = count of elements used to generate sum. Now the second portion comes into play. The elements of each sequence are added in a reduce fashion.
(x,y) =>(x._1 + y._1, x._2 + y._2)
rewriting with more meaningful variables:
val arr = Array(1,2,3,4,5,6)
arr.par.aggregate((0,0))((accumulator,list_elem)=>(accumulator._1+list_elem, accumulator._2+1), (seq1, seq2)=> (seq1._1+seq2._1, seq1._2+seq2._2))

Related

How does this use of aggregate work in Scala?

I have been reading a spark book and this example is from the book
input = List(1,2,3,4,5,6)
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
I am trying to understand how this works and what is the _1 and _2 at each step
(0,0) is the seed value or initial value
This list gets split into sep rdd's
lets say rdd1 contains List(1,2)
loop through this list
(acc, value)
acc = ??? during each iteration of the loop
value = ??? during each iteration of the loop
(acc, value) => (acc._1 + value, acc._2 + 1)
during the first iteration of List(1,2) what is the value of acc._1 and _2 and value
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
acc1 (for 1,2) is something like 3,2 and acc2 (for 3,4) is 7,2
and this function adds 3+7 and 2+2 = 10,4 and add this value to the next group
Dear kind hearted Helpers,
please do not use jargons used in scala, I already read it and did not understand it hence came for help.
For a List(1,2) what will be the value of acc._1 and acc._2 during the first iteration of the list and during that iteration what is the value of 'value' and during the second iteration what are their values?
The first parameter of the aggregation function takes an initial value which in this example is a Tuple (0,0), then the next parameter is seqop which is a function (B, A) => A, in your example it would (Tuple, Int) => Tuple
What is happening here is this function is applied on every parameter of the list one by one. The tuple actually holds on the left side the sum of the list and on the right side the amount of the list passed so far. The result of the aggregation function is (21, 6).
A side note: The implementation of TraversableOnce in Scala doesn't really use the combop parameter which in this example is (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)) So you can just ignore it in this case. If you are familiar with Scala the code that gets executed is:
input.foldLeft((0, 0))((acc, value) => (acc._1 + value, acc._2 + 1))
aggregate works by taking in two functions, one which combines values within a partition and one which combines partitions.
The first function (the one for a single partition) could be more clearly written as
((sum, count), value) => (sum + value, count + 1)
The second function (to combine partitions) could be written as
((partition1Sum, partition1Count), (partition2Sum, partition2Count)) =>
(partition1Sum + partition2Sum, partition1Count + partition2Count)
Note on tuple notation:
In Scala (a, b, c)._1 == a, (a, b, c)._2 == b and so on. _n gives you the nth element of the tuple.

scala parallel collections not consistent

I am getting inconsistent answers from the following code which I find odd.
import scala.math.pow
val p = 2
val a = Array(1,2,3)
println(a.par
.aggregate("0")((x, y) => s"$y pow $p; ", (x, y) => x + y))
for (i <- 1 to 100) {
println(a.par
.aggregate(0.0)((x, y) => pow(y, p), (x, y) => x + y) == 14)
}
a.map(x => pow(x,p)).sum
In the code the a.par ... computes 14 or 10. Can anyone provide an explanation for why it is computing inconsistently?
In your "seqop" function, that is the first function you pass to aggregate, you define the logic that is used to combine elements within the same partition. Your function looks like this:
(x, y) => pow(y, p)
The problem is that you don't accumulate the results of a partition. Instead, you throw away your accumulator x. Every time you get 10 as a result, the calculation 2^2 was dropped.
If you change your function to take the accumulated value into account, you will get 14 every time:
(x, y) => x + pow(y, p)
The correct way to use aggregate is
a.par.aggregate(0.0)(
(acc, value) => acc + pow(value, 2), (acc1, acc2) => acc1 + acc2
)
By using (x,y) => pow(y,2) , you did not accumulate the item to the accumulator but just replaced the accumulator by pow(y,2).

RDD Aggregate in spark

I am an Apache Spark learner and have come across a RDD action aggregate which I have no clue of how it functions. Can some one spell out and explain in detail step by step how did we arrive at the below result for the code here
RDD input = {1,2,3,3}
RDD Aggregate function :
rdd.aggregate((0, 0))
((x, y) =>
(x._1 + y, x._2 + 1),
(x, y) =>
(x._1 + y._1, x._2 + y._2))
output : {9,4}
Thanks
If you are not sure what is going on it is best to follow the types. Omitting implicit ClassTag for brevity we start with something like this
abstract class RDD[T] extends Serializable with Logging
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
If you ignore all the additional parameters you'll see that aggregate is a function which maps from RDD[T] to U. It means that the type of the values in the input RDD doesn't have to be the same as the type of the output value. So it is clearly different than for example reduce:
def reduce(func: (T, T) ⇒ T): T
or fold:
def fold(zeroValue: T)(op: (T, T) => T): T
The same as fold, aggregate requires a zeroValue. How to choose it? It should be an identity (neutral) element with respect to combOp.
You also have to provide two functions:
seqOp which maps from (U, T) to U
combOp which maps from (U, U) to U
Just based on this signatures you should already see that only seqOp may access the raw data. It takes some value of type U another one of type T and returns a value of type U. In your case it is a function with a following signature
((Int, Int), Int) => (Int, Int)
At this point you probably suspect it is used for some kind of fold-like operation.
The second function takes two arguments of type U and returns a value of type U. As stated before it should be clear that it doesn't touch the original data and can operate only on the values already processed by the seqOp. In your case this function has a signature as follows:
((Int, Int), (Int, Int)) => (Int, Int)
So how can we get all of that together?
First each partition is aggregated using standard Iterator.aggregate with zeroValue, seqOp and combOp passed as z, seqop and combop respectivelly. Since InterruptibleIterator used internally doesn't override aggregate it should be executed as a simple foldLeft(zeroValue)(seqOp)
Next partial results collected from each partition are aggregated using combOp
Lets assume that input RDD has three partitions with following distribution of values:
Iterator(1, 2)
Iterator(2, 3)
Iterator()
You can expect that execution, ignoring absolute order, will be equivalent to something like this:
val seqOp = (x: (Int, Int), y: Int) => (x._1 + y, x._2 + 1)
val combOp = (x: (Int, Int), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2)
Seq(Iterator(1, 2), Iterator(3, 3), Iterator())
.map(_.foldLeft((0, 0))(seqOp))
.reduce(combOp)
foldLeft for a single partition can look like this:
Iterator(1, 2).foldLeft((0, 0))(seqOp)
Iterator(2).foldLeft((1, 1))(seqOp)
(3, 2)
and over all partitions
Seq((3,2), (6,2), (0,0))
which combined will give you observed result:
(3 + 6 + 0, 2 + 2 + 0)
(9, 4)
In general this is a common pattern you will find all over Spark where you pass neutral value, a function used to process values per partition and a function used to merge partial aggregates from different partitions. Some other examples include:
aggregateByKey
User Defined Aggregate Functions
Aggregators on Spark Datasets.
Here is my understanding for your reference:
Imagine you have two nodes, one take the input of the first two list elements {1,2}, and another takes {3, 3}. (The partition here is only for convenient)
At the first node:
"(x, y) => (x._1 + y, x._2 + 1)" , the first x is (0,0) as given, and y is your first element 1, and you will have output (0+1, 0+1), then comes your second element y=2, and output (1 + 2, 1 + 1), which is (3, 2)
At the second node, same procedure happens in parallel, and you'll have (6, 2).
"(x, y) => (x._1 + y._1, x._2 + y._2)", tells you to merge two nodes, and you'll get (9,4)
one thing worth noticing is (0,0) is actually added to the result
length(rdd)+1 times.
"scala> rdd.aggregate((1,1)) ((x, y) =>(x._1 + y, x._2 + 1), (x, y) => (x._1 + y._1, x._2 + y._2))
res1: (Int, Int) = (14,9)"

How does aggregate work in scala?

I knew how a normal aggregate works in scala and its use over fold. Tried a lot to know how the below code works, but couldn't. Could someone help me in explaining how it works and gives me a output of (10,4)
val input=List(1,2,3,4)
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
Could someone help me in explaining how it works and gives me a output
of (10,4)
When using aggregate, you provide three parameters:
the initial value from which you accumulate elements from a partition, often it's the neutral element
a function that given a partition, will accumulate the result within it
a function that will combine two partitions
So in your case, the initial value for a partition is the tuple (0, 0).
Then the accumulator function you defined will sum the current element you're traversing with the first element of the tuple and increment the second element of the tuple by one. In fact, it will compute the sum of the elements in a partition and its number of elements.
The combiner function combined two tuples. As you defined it, it will sum the sums and count the number of elements of 2 partitions. It's not used in your case because you traverse the pipeline sequentially. You could call .par on the List so that you get a parallel implementation to see the combiner in action (note that it has to be an associative function).
Thus you get (10, 4) because 1+2+3+4=10 and there was 4 elements in the list (you did 4 additions).
You could add a print statement in the accumulator function (running on a sequential input), to see how it behaves:
Acc: (0,0) - value:1
Acc: (1,1) - value:2
Acc: (3,2) - value:3
Acc: (6,3) - value:4
I knew how a normal aggregate works in scala and its use over fold.
For a sequential input, aggregate is a foldLeft:
def aggregate[B](z: =>B)(seqop: (B, A) => B, combop: (B, B) => B): B = foldLeft(z)(seqop)
For a parallel input, the list is split into chunks so that multiple threads can work separately. The accumulator function is run on each chunk, using the initial value. When two threads need to merge their results, the combine function is used:
def aggregate[S](z: =>S)(seqop: (S, T) => S, combop: (S, S) => S): S = {
tasksupport.executeAndWaitResult(new Aggregate(() => z, seqop, combop, splitter))
}
This is the principle of the fork-join model but it requires that your task can be parallelizable well. It's the case here, because a thread does not need to know the result of another thread to do its job.

Scala foldLeft too many parameters

I have a list of tuples called item, each index in the list contains 2 x Doubles e.g.
item = ((1.0, 2.0), (3.0, 4.0), (10.0, 100.0))
I want to perform a calculation on each index within the list item and I'm trying to do it with foldLeft. This is my code:
item.foldLeft(0.0)(_ + myMethod(_._2, _._1, item.size)))
_._2 accesses the current item Tuple at index 1 and _._1 accesses the current item Tuple at index 0. e.g. for the first fold it should effectively be:
item.foldLeft(0.0)(_ + myMethod(2.0, 1.0, item.size)))
The Second Fold:
item.foldLeft(0.0)(_ + myMethod(4.0, 3.0, item.size)))
The Third Fold:
item.foldLeft(0.0)(_ + myMethod(100.0, 10.0, item.size)))
where myMethod:
def myMethod(i: Double, j:Double, size: Integer) : Double = {
(j - i) / size
}
It is giving me an error which says that there are too many parameters for foldLeft as it requires 2 parameters.
myMethod returns a Double, and _ is a Double. So, where is this extra parameter the compiler is seeing?
If I do this:
item.foldLeft(0.0)(_ + _._1))
It sums up all the first Doubles in each index of item - replacing _._1 with _._2 sums up all the second Doubles in each index of item.
Any help is greatly appreciated!
Each _ is equivalent to a new argument, so (_ + myMethod(_._2, _._1, item.size)) is an anonymous function with 3 arguments: (x, y, z) => x + myMethod(y._2, z._1, item.size).
What you want is (acc, x) => acc + myMethod(x._2, x._1, item.size).