How does this use of aggregate work in Scala? - scala

I have been reading a spark book and this example is from the book
input = List(1,2,3,4,5,6)
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
I am trying to understand how this works and what is the _1 and _2 at each step
(0,0) is the seed value or initial value
This list gets split into sep rdd's
lets say rdd1 contains List(1,2)
loop through this list
(acc, value)
acc = ??? during each iteration of the loop
value = ??? during each iteration of the loop
(acc, value) => (acc._1 + value, acc._2 + 1)
during the first iteration of List(1,2) what is the value of acc._1 and _2 and value
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
acc1 (for 1,2) is something like 3,2 and acc2 (for 3,4) is 7,2
and this function adds 3+7 and 2+2 = 10,4 and add this value to the next group
Dear kind hearted Helpers,
please do not use jargons used in scala, I already read it and did not understand it hence came for help.
For a List(1,2) what will be the value of acc._1 and acc._2 during the first iteration of the list and during that iteration what is the value of 'value' and during the second iteration what are their values?

The first parameter of the aggregation function takes an initial value which in this example is a Tuple (0,0), then the next parameter is seqop which is a function (B, A) => A, in your example it would (Tuple, Int) => Tuple
What is happening here is this function is applied on every parameter of the list one by one. The tuple actually holds on the left side the sum of the list and on the right side the amount of the list passed so far. The result of the aggregation function is (21, 6).
A side note: The implementation of TraversableOnce in Scala doesn't really use the combop parameter which in this example is (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)) So you can just ignore it in this case. If you are familiar with Scala the code that gets executed is:
input.foldLeft((0, 0))((acc, value) => (acc._1 + value, acc._2 + 1))

aggregate works by taking in two functions, one which combines values within a partition and one which combines partitions.
The first function (the one for a single partition) could be more clearly written as
((sum, count), value) => (sum + value, count + 1)
The second function (to combine partitions) could be written as
((partition1Sum, partition1Count), (partition2Sum, partition2Count)) =>
(partition1Sum + partition2Sum, partition1Count + partition2Count)
Note on tuple notation:
In Scala (a, b, c)._1 == a, (a, b, c)._2 == b and so on. _n gives you the nth element of the tuple.

Related

Scala underscore notation for map and filter

Say I have the following code:
val a: List[(Int, String)] = List((1,"A"),(2,"B"),(3,"C"))
val b: List[String] = List("A","C","E")
I can do:
a.map{case (fst,snd) => (fst,snd + "a")}
a.filter{case (_,snd) => b.contains(snd)}
But why can't I do:
a.map((_._1,_._2 + "a"))
a.filter(b.contains(_._2))
Is there a way to accomplish this using underscore notation, or am I forced here?
For the example:
a.map((_._1,_._2 + "a"))
Each placeholder (i.e. each underscore/_) introduces a new parameter in the argument expression.
To cite the Scala spec
An expression (of syntactic category Expr)
may contain embedded underscore symbols _ at places where identifiers
are legal. Such an expression represents an anonymous function where subsequent
occurrences of underscores denote successive parameters.
[...]
The anonymous functions in the left column use placeholder
syntax. Each of these is equivalent to the anonymous function on its right.
|---------------------------|----------------------------|
|`_ + 1` | `x => x + 1` |
|`_ * _` | `(x1, x2) => x1 * x2` |
|`(_: Int) * 2` | `(x: Int) => (x: Int) * 2` |
|`if (_) x else y` | `z => if (z) x else y` |
|`_.map(f)` | `x => x.map(f)` |
|`_.map(_ + 1)` | `x => x.map(y => y + 1)` |
You'll have to use the expanded forms when you need to use a given parameter more than once. So your example has to be rewritten as:
a.map(x => (x._1, x._2 + "a"))
For the example
a.filter(b.contains(_._2))
The problem is that you are effectively passing in an anonymous function to contains rather than filter, so you won't be able to use underscore notation here either. Instead you'll have to write
a.filter(x => b.contains(x._2))
You can't do
a.map((_._1,_._2 + "a"))
because _ will match the elements of the iterable for each iteration. The first _ will match with the elements of the first iterable and second _ will match with the elements of the second iterable and so on. _._1 will match the first element of tupled elements of the first iterable, but _._2 will try to get the second element of tupled elements of second iterable. As there is no second iterable, Scala compiler would throw compilation error
In your second line of code
a.filter(b.contains(_._2))
_._2 tries to get the second element of tupled iterable of b, but b is not a tupled iterable. b is simply a iterable of String.
to make it work you can do
a.map(x => (x._1, x._2 + "a"))
a.filter(x => b.contains(x._2))

Sum values of each unique key in Apache Spark RDD

I have an RDD[(String, (Long, Long))] where each element is not unique:
(com.instagram.android,(2,0))
(com.android.contacts,(6,1))
(com.android.contacts,(3,4))
(com.instagram.android,(8,3))
...
So I want to obtain an RDD where each element is the sum of the two values for every unique key:
(com.instagram.android,(10,3))
(com.android.contacts,(9,5))
...
Here is my code:
val appNamesAndPropertiesRdd = appNodesRdd.map({
case Row(_, appName, totalUsageTime, usageFrequency, _, _, _, _) =>
(appName, (totalUsageTime, usageFrequency))
})
Use reduceByKey:
val rdd = appNamesAndPropertiesRdd.reduceByKey(
(acc, elem) => (acc._1 + elem._1, acc._2 + elem._2)
)
reduceByKey uses aggregateByKey described by SCouto, but has more readable usage. For your case, more advanced features of aggregateByKey - hidden by simpler API of reduceBykey - are not necessary
First of all, I don't think that usageFrequency should be simply added up.
Now, lets come to what you want to do, You want to add things by key, you can do it
1.Using groupByKey and then reducing the groups to sum things up,
val requiredRdd = appNamesAndPropertiesRdd
.groupBy({ case (an, (tut, uf)) => an })
.map({
case (an, iter) => (
an,
iter
.map({ case (an, tut, uf) => (tut, tf) })
.reduce({ case ((tut1, tf1), (tut2, tf2)) => (tut1 + tut2, tf1 + tf2) })
)
})
Or by using reduceByKey
val requiredRdd = appNamesAndPropertiesRdd
.reduceByKey({
case ((tut1, uf1), (tut2, uf2)) => (tut1 + tut2, tf1 + tf2)
})
And reduceByKey is a better choice for two reasons,
It saves a not so required group operation.
The groupBy approach can lead to a reshuffle which will be expensive.
The function aggregateByKey is the best one for this purpose
appNamesAndPropertiesRdd.aggregateByKey((0, 0))((acc, elem) => (acc._1 + elem._1, acc._2 +elem._2 ),(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
Explained here:
aggregateByKey((0, 0)) => This is the zerovalue. The value that will be the initial one. In your case, since you want the addition, 0,0 will be the initial value (0.0, 0.0) if you want double instead of int
((acc, elem) => (acc._1 + elem._1, acc._2 +elem._2 ) => The first function. To accumulate the elements in the same partition. The accumulator will hold the partial value. Since elem is a tuple, you need to add each part of it to the correpondent part of the accumulator
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)) => The second function. To accumulate the accumulator from each partition.
Try this logic,
rdd.groupBy(_._1).map(x=> (x._1, (x._2.map(_._2).foldLeft((0,0)) {case ((acc1, acc2),(a, b))=> (acc1+a, acc2+b)} )))

Can someone explain this scala aggregate function with two initial values

I am very new to Scala this problem I was trying to solve in spark, which also uses Scala for performing operation on RDD's.
Till now, I have only seen aggregate functions with only one initial value (i,e some-input.aggregate(Initial-value)((acc,value)=>(acc+value))), but this program has two initial values (0,0).
As per my understanding this program is for calculating the running average and keeping track of the count so far.
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
I know that in foldLeft / aggregate we supply initial values, so that in case of empty collection we get the default value, and both have accumulator and value part.
But in this case, we have two initial values, and accumulator is accessing tuple values. Where is this tuple defined?
Can someone please explain this whole program line by line.
but this program has two initial values (0,0).
They aren't two parameters, they're one Tuple2:
input.aggregate((0, 0))
The value passed to aggregate is surrounded by additional round brackets, (( )), which are used as syntactic sugar for Tuple2.apply. This is where you're seeing the tuple come from.
If you look a the method definition (I'm assuming this is RDD.aggregate), you'll see it takes a single parameter in the first argument list:
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)
(implicit arg0: ClassTag[U]): U

RDD Aggregate in spark

I am an Apache Spark learner and have come across a RDD action aggregate which I have no clue of how it functions. Can some one spell out and explain in detail step by step how did we arrive at the below result for the code here
RDD input = {1,2,3,3}
RDD Aggregate function :
rdd.aggregate((0, 0))
((x, y) =>
(x._1 + y, x._2 + 1),
(x, y) =>
(x._1 + y._1, x._2 + y._2))
output : {9,4}
Thanks
If you are not sure what is going on it is best to follow the types. Omitting implicit ClassTag for brevity we start with something like this
abstract class RDD[T] extends Serializable with Logging
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
If you ignore all the additional parameters you'll see that aggregate is a function which maps from RDD[T] to U. It means that the type of the values in the input RDD doesn't have to be the same as the type of the output value. So it is clearly different than for example reduce:
def reduce(func: (T, T) ⇒ T): T
or fold:
def fold(zeroValue: T)(op: (T, T) => T): T
The same as fold, aggregate requires a zeroValue. How to choose it? It should be an identity (neutral) element with respect to combOp.
You also have to provide two functions:
seqOp which maps from (U, T) to U
combOp which maps from (U, U) to U
Just based on this signatures you should already see that only seqOp may access the raw data. It takes some value of type U another one of type T and returns a value of type U. In your case it is a function with a following signature
((Int, Int), Int) => (Int, Int)
At this point you probably suspect it is used for some kind of fold-like operation.
The second function takes two arguments of type U and returns a value of type U. As stated before it should be clear that it doesn't touch the original data and can operate only on the values already processed by the seqOp. In your case this function has a signature as follows:
((Int, Int), (Int, Int)) => (Int, Int)
So how can we get all of that together?
First each partition is aggregated using standard Iterator.aggregate with zeroValue, seqOp and combOp passed as z, seqop and combop respectivelly. Since InterruptibleIterator used internally doesn't override aggregate it should be executed as a simple foldLeft(zeroValue)(seqOp)
Next partial results collected from each partition are aggregated using combOp
Lets assume that input RDD has three partitions with following distribution of values:
Iterator(1, 2)
Iterator(2, 3)
Iterator()
You can expect that execution, ignoring absolute order, will be equivalent to something like this:
val seqOp = (x: (Int, Int), y: Int) => (x._1 + y, x._2 + 1)
val combOp = (x: (Int, Int), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2)
Seq(Iterator(1, 2), Iterator(3, 3), Iterator())
.map(_.foldLeft((0, 0))(seqOp))
.reduce(combOp)
foldLeft for a single partition can look like this:
Iterator(1, 2).foldLeft((0, 0))(seqOp)
Iterator(2).foldLeft((1, 1))(seqOp)
(3, 2)
and over all partitions
Seq((3,2), (6,2), (0,0))
which combined will give you observed result:
(3 + 6 + 0, 2 + 2 + 0)
(9, 4)
In general this is a common pattern you will find all over Spark where you pass neutral value, a function used to process values per partition and a function used to merge partial aggregates from different partitions. Some other examples include:
aggregateByKey
User Defined Aggregate Functions
Aggregators on Spark Datasets.
Here is my understanding for your reference:
Imagine you have two nodes, one take the input of the first two list elements {1,2}, and another takes {3, 3}. (The partition here is only for convenient)
At the first node:
"(x, y) => (x._1 + y, x._2 + 1)" , the first x is (0,0) as given, and y is your first element 1, and you will have output (0+1, 0+1), then comes your second element y=2, and output (1 + 2, 1 + 1), which is (3, 2)
At the second node, same procedure happens in parallel, and you'll have (6, 2).
"(x, y) => (x._1 + y._1, x._2 + y._2)", tells you to merge two nodes, and you'll get (9,4)
one thing worth noticing is (0,0) is actually added to the result
length(rdd)+1 times.
"scala> rdd.aggregate((1,1)) ((x, y) =>(x._1 + y, x._2 + 1), (x, y) => (x._1 + y._1, x._2 + y._2))
res1: (Int, Int) = (14,9)"

Explanation of the aggregate scala function

I do not get to understand yet the aggregate function:
For example, having:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
The result will be: (21,6)
Well, I think that (x,y) => (x._1 + y._1, x._2 + y._2) is to get the result in parallel, for example it will be (1 + 2, 1 + 1) and so on.
But exactly this part that leaves me confused:
(x, y) => (x._1 + y, x._2 + 1)
why x._1 + y? and here x._2 is 0?
Thanks in advance.
First of all Thanks to Diego's reply which helped me connect the dots in understanding aggregate() function..
Let me confess that I couldn't sleep last night properly because I couldn't get how aggregate() works internally, I'll get good sleep tonight definitely :-)
Let's start understanding it
val result = List(1,2,3,4,5,6,7,8,9,10).par.aggregate((0, 0))
(
(x, y) => (x._1 + y, x._2 + 1),
(x,y) =>(x._1 + y._1, x._2 + y._2)
)
result: (Int, Int) = (55,10)
aggregate function has 3 parts :
initial value of accumulators : tuple(0,0) here
seqop : It works like foldLeft with initial value of 0
combop : It combines the result generated through parallelization (this part was difficult for me to understand)
Let's understand all 3 parts independently :
part-1 : Initial tuple (0,0)
Aggregate() starts with initial value of accumulators x which is (0,0) here. First tuple x._1 which is initially 0 is used to compute the sum, Second tuple x._2 is used to compute total number of elements in the list.
part-2 : (x, y) => (x._1 + y, x._2 + 1)
If you know how foldLeft works in scala then it should be easy to understand this part. Above function works just like foldLeft on our List(1,2,3,4...10).
Iteration# (x._1 + y, x._2 + 1)
1 (0+1, 0+1)
2 (1+2, 1+1)
3 (3+3, 2+1)
4 (6+4, 3+1)
. ....
. ....
10 (45+10, 9+1)
thus after all 10 iteration you'll get the result (55,10).
If you understand this part the rest is very easy but for me it was the most difficult part in understanding if all the required computation are finished then what is the use of second part i.e. compop - stay tuned :-)
part 3 : (x,y) =>(x._1 + y._1, x._2 + y._2)
Well this 3rd part is combOp which combines the result generated by different threads during parallelization, remember we used 'par' in our code to enable parallel computation of list :
List(1,2,3,4,5,6,7,8,9,10).par.aggregate(....)
Apache spark is effectively using aggregate function to do parallel computation of RDD.
Let's assume that our List(1,2,3,4,5,6,7,8,9,10) is being computed by 3 threads in parallel. Here each thread is working on partial list and then our aggregate() combOp will combine the result of each thread's computation using the below code :
(x,y) =>(x._1 + y._1, x._2 + y._2)
Original list : List(1,2,3,4,5,6,7,8,9,10)
Thread1 start computing on partial list say (1,2,3,4), Thread2 computes (5,6,7,8) and Thread3 computes partial list say (9,10)
At the end of computation, Thread-1 result will be (10,4), Thread-2 result will be (26,4) and Thread-3 result will be (19,2).
At the end of parallel computation, we'll have ((10,4),(26,4),(19,2))
Iteration# (x._1 + y._1, x._2 + y._2)
1 (0+10, 0+4)
2 (10+26, 4+4)
3 (36+19, 8+2)
which is (55,10).
Finally let me re-iterate that seqOp job is to compute the sum of all the elements of list and total number of list whereas combine function's job is to combine different partial result generated during parallelization.
I hope above explanation help you understand the aggregate().
From the documentation:
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Aggregates the results of applying an operator to subsequent elements.
This is a more general form of fold and reduce. It has similar
semantics, but does not require the result to be a supertype of the
element type. It traverses the elements in different partitions
sequentially, using seqop to update the result, and then applies
combop to results from different partitions. The implementation of
this operation may operate on an arbitrary number of collection
partitions, so combop may be invoked an arbitrary number of times.
For example, one might want to process some elements and then produce
a Set. In this case, seqop would process an element and append it to
the list, while combop would concatenate two lists from different
partitions together. The initial value z would be an empty set.
pc.aggregate(Set[Int]())(_ += process(_), _ ++ _)
Another example is
calculating geometric mean from a collection of doubles (one would
typically require big doubles for this). B the type of accumulated
results z the initial value for the accumulated result of the
partition - this will typically be the neutral element for the seqop
operator (e.g. Nil for list concatenation or 0 for summation) and may
be evaluated more than once seqop an operator used to accumulate
results within a partition combop an associative operator used to
combine results from different partitions
In your example B is a Tuple2[Int, Int]. The method seqop then takes a single element from the list, scoped as y, and updates the aggregate B to (x._1 + y, x._2 + 1). So it increments the second element in the tuple. This effectively puts the sum of elements into the first element of the tuple and the number of elements into the second element of the tuple.
The method combop then takes the results from each parallel execution thread and combines them. Combination by addition provides the same results as if it were run on the list sequentially.
Using B as a tuple is likely the confusing piece of this. You can break the problem down into two sub problems to get a better idea of what this is doing. res0 is the first element in the result tuple, and res1 is the second element in the result tuple.
// Sums all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + y, (x, y) => x + y)
res0: Int = 21
// Counts all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + 1, (x, y) => x + y)
res1: Int = 6
aggregate takes 3 parameters: a seed value, a computation function and a combination function.
What it does is basically split the collection in a number of threads, compute partial results using the computation function and then combine all these partial results using the combination function.
From what I can tell, your example function will return a pair (a, b) where a is the sum of the values in the list, b is the number of values in the list. Indeed, (21, 6).
How does this work? The seed value is the (0,0) pair. For an empty list, we have a sum of 0 and a number of items 0, so this is correct.
Your computation function takes an (Int, Int) pair x, which is your partial result, and a Int y, which is the next value in the list. This is your:
(x, y) => (x._1 + y, x._2 + 1)
Indeed, the result that we want is to increase the left element of x (the accumulator) by y, and the right element of x (the counter) by 1 for each y.
Your combination function takes an (Int, Int) pair x and an (Int, Int) pair y, which are your two partial results from different parallel computations, and combines them together as:
(x,y) => (x._1 + y._1, x._2 + y._2)
Indeed, we sum independently the left parts of the pairs and right parts of the pairs.
Your confusion comes from the fact that x and y in the first function ARE NOT the same x and y of the second function. In the first function, you have x of the type of the seed value, and y of the type of the collection elements, and you return a result of the type of x. In the second function, your two parameters are both of the same type of your seed value.
Hope it's clearer now!
Adding to Rashmit answer.
CombOp is called only if the collection is processed in parallel mode.
See below example :
val listP: ParSeq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).par
val aggregateOp1 = listP.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp1)
OP : Aggregate-1,Aggregate-2,Aggregate-3,Aggregate-45,Aggregate-6,Aggregate-7,Aggregate-8,Aggregate-910
val list: Seq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val aggregateOp2 = list.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp2)
}
OP : Aggregate-12345678910
In above example, combiner operation is called only if collection is operated in parallel
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Breaking that down a little :
aggregate(accumulator)(accumulator+first_elem_of_list, (seq1,seq2)=>seq1+seq2)
Now looking at the example:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
Here:
Accumulator is (0,0)
Defined list is x
First elem of x is 1
So for each iteration, we are taking the accumulator and adding the elements of x to position 1 of the accumulator to get the sum and increasing position 2 of the accumulator by 1 to get the count. (y is the elements of the list)
(x, y) => (x._1 + y, x._2 + 1)
Now, since this is a parallel implementation, the first portion will give rise to a list of tuples like (3,2) (7,2) and (11,2). index 1 = Sum, index 2 = count of elements used to generate sum. Now the second portion comes into play. The elements of each sequence are added in a reduce fashion.
(x,y) =>(x._1 + y._1, x._2 + y._2)
rewriting with more meaningful variables:
val arr = Array(1,2,3,4,5,6)
arr.par.aggregate((0,0))((accumulator,list_elem)=>(accumulator._1+list_elem, accumulator._2+1), (seq1, seq2)=> (seq1._1+seq2._1, seq1._2+seq2._2))