RDD Aggregate in spark - scala

I am an Apache Spark learner and have come across a RDD action aggregate which I have no clue of how it functions. Can some one spell out and explain in detail step by step how did we arrive at the below result for the code here
RDD input = {1,2,3,3}
RDD Aggregate function :
rdd.aggregate((0, 0))
((x, y) =>
(x._1 + y, x._2 + 1),
(x, y) =>
(x._1 + y._1, x._2 + y._2))
output : {9,4}
Thanks

If you are not sure what is going on it is best to follow the types. Omitting implicit ClassTag for brevity we start with something like this
abstract class RDD[T] extends Serializable with Logging
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
If you ignore all the additional parameters you'll see that aggregate is a function which maps from RDD[T] to U. It means that the type of the values in the input RDD doesn't have to be the same as the type of the output value. So it is clearly different than for example reduce:
def reduce(func: (T, T) ⇒ T): T
or fold:
def fold(zeroValue: T)(op: (T, T) => T): T
The same as fold, aggregate requires a zeroValue. How to choose it? It should be an identity (neutral) element with respect to combOp.
You also have to provide two functions:
seqOp which maps from (U, T) to U
combOp which maps from (U, U) to U
Just based on this signatures you should already see that only seqOp may access the raw data. It takes some value of type U another one of type T and returns a value of type U. In your case it is a function with a following signature
((Int, Int), Int) => (Int, Int)
At this point you probably suspect it is used for some kind of fold-like operation.
The second function takes two arguments of type U and returns a value of type U. As stated before it should be clear that it doesn't touch the original data and can operate only on the values already processed by the seqOp. In your case this function has a signature as follows:
((Int, Int), (Int, Int)) => (Int, Int)
So how can we get all of that together?
First each partition is aggregated using standard Iterator.aggregate with zeroValue, seqOp and combOp passed as z, seqop and combop respectivelly. Since InterruptibleIterator used internally doesn't override aggregate it should be executed as a simple foldLeft(zeroValue)(seqOp)
Next partial results collected from each partition are aggregated using combOp
Lets assume that input RDD has three partitions with following distribution of values:
Iterator(1, 2)
Iterator(2, 3)
Iterator()
You can expect that execution, ignoring absolute order, will be equivalent to something like this:
val seqOp = (x: (Int, Int), y: Int) => (x._1 + y, x._2 + 1)
val combOp = (x: (Int, Int), y: (Int, Int)) => (x._1 + y._1, x._2 + y._2)
Seq(Iterator(1, 2), Iterator(3, 3), Iterator())
.map(_.foldLeft((0, 0))(seqOp))
.reduce(combOp)
foldLeft for a single partition can look like this:
Iterator(1, 2).foldLeft((0, 0))(seqOp)
Iterator(2).foldLeft((1, 1))(seqOp)
(3, 2)
and over all partitions
Seq((3,2), (6,2), (0,0))
which combined will give you observed result:
(3 + 6 + 0, 2 + 2 + 0)
(9, 4)
In general this is a common pattern you will find all over Spark where you pass neutral value, a function used to process values per partition and a function used to merge partial aggregates from different partitions. Some other examples include:
aggregateByKey
User Defined Aggregate Functions
Aggregators on Spark Datasets.

Here is my understanding for your reference:
Imagine you have two nodes, one take the input of the first two list elements {1,2}, and another takes {3, 3}. (The partition here is only for convenient)
At the first node:
"(x, y) => (x._1 + y, x._2 + 1)" , the first x is (0,0) as given, and y is your first element 1, and you will have output (0+1, 0+1), then comes your second element y=2, and output (1 + 2, 1 + 1), which is (3, 2)
At the second node, same procedure happens in parallel, and you'll have (6, 2).
"(x, y) => (x._1 + y._1, x._2 + y._2)", tells you to merge two nodes, and you'll get (9,4)
one thing worth noticing is (0,0) is actually added to the result
length(rdd)+1 times.
"scala> rdd.aggregate((1,1)) ((x, y) =>(x._1 + y, x._2 + 1), (x, y) => (x._1 + y._1, x._2 + y._2))
res1: (Int, Int) = (14,9)"

Related

How does this use of aggregate work in Scala?

I have been reading a spark book and this example is from the book
input = List(1,2,3,4,5,6)
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
val avg = result._1 / result._2.toDouble
I am trying to understand how this works and what is the _1 and _2 at each step
(0,0) is the seed value or initial value
This list gets split into sep rdd's
lets say rdd1 contains List(1,2)
loop through this list
(acc, value)
acc = ??? during each iteration of the loop
value = ??? during each iteration of the loop
(acc, value) => (acc._1 + value, acc._2 + 1)
during the first iteration of List(1,2) what is the value of acc._1 and _2 and value
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
acc1 (for 1,2) is something like 3,2 and acc2 (for 3,4) is 7,2
and this function adds 3+7 and 2+2 = 10,4 and add this value to the next group
Dear kind hearted Helpers,
please do not use jargons used in scala, I already read it and did not understand it hence came for help.
For a List(1,2) what will be the value of acc._1 and acc._2 during the first iteration of the list and during that iteration what is the value of 'value' and during the second iteration what are their values?
The first parameter of the aggregation function takes an initial value which in this example is a Tuple (0,0), then the next parameter is seqop which is a function (B, A) => A, in your example it would (Tuple, Int) => Tuple
What is happening here is this function is applied on every parameter of the list one by one. The tuple actually holds on the left side the sum of the list and on the right side the amount of the list passed so far. The result of the aggregation function is (21, 6).
A side note: The implementation of TraversableOnce in Scala doesn't really use the combop parameter which in this example is (acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2)) So you can just ignore it in this case. If you are familiar with Scala the code that gets executed is:
input.foldLeft((0, 0))((acc, value) => (acc._1 + value, acc._2 + 1))
aggregate works by taking in two functions, one which combines values within a partition and one which combines partitions.
The first function (the one for a single partition) could be more clearly written as
((sum, count), value) => (sum + value, count + 1)
The second function (to combine partitions) could be written as
((partition1Sum, partition1Count), (partition2Sum, partition2Count)) =>
(partition1Sum + partition2Sum, partition1Count + partition2Count)
Note on tuple notation:
In Scala (a, b, c)._1 == a, (a, b, c)._2 == b and so on. _n gives you the nth element of the tuple.

Tuple element is not getting converted to float during reduceByKey

I am preparing for CCA175, I am using the oldest available version of spark, Spark 1.3.0.
As shown below, I am converting the element to Float while mapping but while reducing it is showing a compile time error.
scala> val revenuePerDay = ordersJoinOrderItems.map(x => (x._2._1, (x._1, (x._2._2).toFloat)))
revenuePerDay: org.apache.spark.rdd.RDD[(String, (Int, Float))] =
MapPartitionsRDD[21] at map at <console>:31
After mapping I can see that it is mapped as Float but when I am running the below command it is showing an error:
scala> revenuePerDay.reduceByKey((x,y) => x._2._2 + y._2._2)
<console>:34: error: value _2 is not a member of Float
revenuePerDay.reduceByKey((x,y) => x._2._2 + y._2._2)
^
PairRDDFunctions.reduceByKey works on a pair of values:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
Since your tuple is of the form: (String, (Int, Float)), the key (String) isn't part of the method signature.
reduceByKey expects a function of type (V, V) => V. Since your input is of type (Int, Float), and result is of type Float, this won't work.
Instead, we'll need to use the more verbose PairRDDFunctions.combineByKey:
revenuePerDay.combineByKey[Float](_._2, (acc, x) => acc + x._2, (x, y) => x + y)
Or, you can use the slightly similar PairRDDFunctions.aggregateByKey:
revenuePerDay.aggregateByKey(0F)((acc, x) => acc + x._2, (x, y) => x + y)
Edit
Another suggestion by #zero323 is to use mapValues with reduceByKey:
revenuePerDay.mapValues(_._2).reduceByKey(_ + _)

SortByValue for a RDD of tuples

Recently I was asked (in a class assignment) to find the top 10 occurring words inside RDD. I submitted my assignment with a working solution which looks like
wordsRdd
.map(x => (x, 1))
.reduceByKey(_ + _)
.map(case (x, y) => (y, x))
.sortByKey(false)
.map(case (x, y) => (y, x))
.take(10)
So basically, I swap the tuple, sort by key, and then swap again. Then finally take 10. I don't find the repeated swapping very elegant.
So I wonder if there is a more elegant way of doing this.
I searched and found some people using Scala implicits to convert the RDD into a Scala Sequence and then doing the sortByValue, but I don't want to convert RDD to a Scala Seq, because that will kill the distributed nature of the RDD.
So is there a better way?
How about this:
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(Ordering.by(-1 * _._2))
or a little bit more verbose:
object WordCountPairsOrdering extends Ordering[(String, Int)] {
def compare(a: (String, Int), b: (String, Int)) = b._2.compare(a._2)
}
wordsRdd.
map(x => (x, 1)).
reduceByKey(_ + _).
takeOrdered(10)(WordCountPairsOrdering)

Is it possible to user reduceByKey((x, y, z) => ...)?

Is it possible to have a reduceByKey of the way: reduceByKey((x, y, z) => ...)?
Because I have a RDD:
RDD[((String, String, Double), (Double, Double, scala.collection.immutable.Map[String,Double]))]
And I want reduce by key and I tried with this operation:
reduceByKey((x, y, z) => (x._1 + y._1 + z._1, x._2 + y._2 + z._2, (((x._3)++y._3)++z._3)))
and it shows me a error message: missing parameter type
Before I tested with two elements and it works, but with 3 I really don't know which is my error. What is the way to do that?
Here's what you're missing, reduceByKey is telling you that you have a Key-Value pairing. Conceptually there can only ever be 2 items in a pair, it's part of the what makes a pair a pair. Hence, the full signature of reduceByKey can only ever be a 2-Tuple as it's signature. So, no, you can't directly have a function of arity 3, only of arity 2.
Here's how I'd handle your situation:
reduceByKey((key,value) =>
val (one, two, three) = key
val (dub1, dub2, nameName) = value
// rest of work
}
However, let me make one slight suggestion? Use a case class for your value. It's easier to grok and is essentially equivalent to your 3-tuple.
if you see the reduceByKey function on PairRDDFunctions, it looks like,
def reduceByKey(func: (V, V) => V): RDD[(K, V)]
hence, its not possible to have it work on a 3-tuple.
However, you can wrap your 3-tuple into a model and still keep your first string as the key making your RDD as RDD[(string, your-model)] and now you can aggregate the model in whatever way you want.
Hope this helps.

Explanation of the aggregate scala function

I do not get to understand yet the aggregate function:
For example, having:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
The result will be: (21,6)
Well, I think that (x,y) => (x._1 + y._1, x._2 + y._2) is to get the result in parallel, for example it will be (1 + 2, 1 + 1) and so on.
But exactly this part that leaves me confused:
(x, y) => (x._1 + y, x._2 + 1)
why x._1 + y? and here x._2 is 0?
Thanks in advance.
First of all Thanks to Diego's reply which helped me connect the dots in understanding aggregate() function..
Let me confess that I couldn't sleep last night properly because I couldn't get how aggregate() works internally, I'll get good sleep tonight definitely :-)
Let's start understanding it
val result = List(1,2,3,4,5,6,7,8,9,10).par.aggregate((0, 0))
(
(x, y) => (x._1 + y, x._2 + 1),
(x,y) =>(x._1 + y._1, x._2 + y._2)
)
result: (Int, Int) = (55,10)
aggregate function has 3 parts :
initial value of accumulators : tuple(0,0) here
seqop : It works like foldLeft with initial value of 0
combop : It combines the result generated through parallelization (this part was difficult for me to understand)
Let's understand all 3 parts independently :
part-1 : Initial tuple (0,0)
Aggregate() starts with initial value of accumulators x which is (0,0) here. First tuple x._1 which is initially 0 is used to compute the sum, Second tuple x._2 is used to compute total number of elements in the list.
part-2 : (x, y) => (x._1 + y, x._2 + 1)
If you know how foldLeft works in scala then it should be easy to understand this part. Above function works just like foldLeft on our List(1,2,3,4...10).
Iteration# (x._1 + y, x._2 + 1)
1 (0+1, 0+1)
2 (1+2, 1+1)
3 (3+3, 2+1)
4 (6+4, 3+1)
. ....
. ....
10 (45+10, 9+1)
thus after all 10 iteration you'll get the result (55,10).
If you understand this part the rest is very easy but for me it was the most difficult part in understanding if all the required computation are finished then what is the use of second part i.e. compop - stay tuned :-)
part 3 : (x,y) =>(x._1 + y._1, x._2 + y._2)
Well this 3rd part is combOp which combines the result generated by different threads during parallelization, remember we used 'par' in our code to enable parallel computation of list :
List(1,2,3,4,5,6,7,8,9,10).par.aggregate(....)
Apache spark is effectively using aggregate function to do parallel computation of RDD.
Let's assume that our List(1,2,3,4,5,6,7,8,9,10) is being computed by 3 threads in parallel. Here each thread is working on partial list and then our aggregate() combOp will combine the result of each thread's computation using the below code :
(x,y) =>(x._1 + y._1, x._2 + y._2)
Original list : List(1,2,3,4,5,6,7,8,9,10)
Thread1 start computing on partial list say (1,2,3,4), Thread2 computes (5,6,7,8) and Thread3 computes partial list say (9,10)
At the end of computation, Thread-1 result will be (10,4), Thread-2 result will be (26,4) and Thread-3 result will be (19,2).
At the end of parallel computation, we'll have ((10,4),(26,4),(19,2))
Iteration# (x._1 + y._1, x._2 + y._2)
1 (0+10, 0+4)
2 (10+26, 4+4)
3 (36+19, 8+2)
which is (55,10).
Finally let me re-iterate that seqOp job is to compute the sum of all the elements of list and total number of list whereas combine function's job is to combine different partial result generated during parallelization.
I hope above explanation help you understand the aggregate().
From the documentation:
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Aggregates the results of applying an operator to subsequent elements.
This is a more general form of fold and reduce. It has similar
semantics, but does not require the result to be a supertype of the
element type. It traverses the elements in different partitions
sequentially, using seqop to update the result, and then applies
combop to results from different partitions. The implementation of
this operation may operate on an arbitrary number of collection
partitions, so combop may be invoked an arbitrary number of times.
For example, one might want to process some elements and then produce
a Set. In this case, seqop would process an element and append it to
the list, while combop would concatenate two lists from different
partitions together. The initial value z would be an empty set.
pc.aggregate(Set[Int]())(_ += process(_), _ ++ _)
Another example is
calculating geometric mean from a collection of doubles (one would
typically require big doubles for this). B the type of accumulated
results z the initial value for the accumulated result of the
partition - this will typically be the neutral element for the seqop
operator (e.g. Nil for list concatenation or 0 for summation) and may
be evaluated more than once seqop an operator used to accumulate
results within a partition combop an associative operator used to
combine results from different partitions
In your example B is a Tuple2[Int, Int]. The method seqop then takes a single element from the list, scoped as y, and updates the aggregate B to (x._1 + y, x._2 + 1). So it increments the second element in the tuple. This effectively puts the sum of elements into the first element of the tuple and the number of elements into the second element of the tuple.
The method combop then takes the results from each parallel execution thread and combines them. Combination by addition provides the same results as if it were run on the list sequentially.
Using B as a tuple is likely the confusing piece of this. You can break the problem down into two sub problems to get a better idea of what this is doing. res0 is the first element in the result tuple, and res1 is the second element in the result tuple.
// Sums all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + y, (x, y) => x + y)
res0: Int = 21
// Counts all elements in parallel.
scala> x.par.aggregate(0)((x, y) => x + 1, (x, y) => x + y)
res1: Int = 6
aggregate takes 3 parameters: a seed value, a computation function and a combination function.
What it does is basically split the collection in a number of threads, compute partial results using the computation function and then combine all these partial results using the combination function.
From what I can tell, your example function will return a pair (a, b) where a is the sum of the values in the list, b is the number of values in the list. Indeed, (21, 6).
How does this work? The seed value is the (0,0) pair. For an empty list, we have a sum of 0 and a number of items 0, so this is correct.
Your computation function takes an (Int, Int) pair x, which is your partial result, and a Int y, which is the next value in the list. This is your:
(x, y) => (x._1 + y, x._2 + 1)
Indeed, the result that we want is to increase the left element of x (the accumulator) by y, and the right element of x (the counter) by 1 for each y.
Your combination function takes an (Int, Int) pair x and an (Int, Int) pair y, which are your two partial results from different parallel computations, and combines them together as:
(x,y) => (x._1 + y._1, x._2 + y._2)
Indeed, we sum independently the left parts of the pairs and right parts of the pairs.
Your confusion comes from the fact that x and y in the first function ARE NOT the same x and y of the second function. In the first function, you have x of the type of the seed value, and y of the type of the collection elements, and you return a result of the type of x. In the second function, your two parameters are both of the same type of your seed value.
Hope it's clearer now!
Adding to Rashmit answer.
CombOp is called only if the collection is processed in parallel mode.
See below example :
val listP: ParSeq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10).par
val aggregateOp1 = listP.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp1)
OP : Aggregate-1,Aggregate-2,Aggregate-3,Aggregate-45,Aggregate-6,Aggregate-7,Aggregate-8,Aggregate-910
val list: Seq[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val aggregateOp2 = list.aggregate[String]("Aggregate-")((a, b) => a + b, (s1, s2) => {
println("Combiner called , if collections is processed parallel mode")
s1 + "," + s2
})
println(aggregateOp2)
}
OP : Aggregate-12345678910
In above example, combiner operation is called only if collection is operated in parallel
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
Breaking that down a little :
aggregate(accumulator)(accumulator+first_elem_of_list, (seq1,seq2)=>seq1+seq2)
Now looking at the example:
val x = List(1,2,3,4,5,6)
val y = x.par.aggregate((0, 0))((x, y) => (x._1 + y, x._2 + 1), (x,y) => (x._1 + y._1, x._2 + y._2))
Here:
Accumulator is (0,0)
Defined list is x
First elem of x is 1
So for each iteration, we are taking the accumulator and adding the elements of x to position 1 of the accumulator to get the sum and increasing position 2 of the accumulator by 1 to get the count. (y is the elements of the list)
(x, y) => (x._1 + y, x._2 + 1)
Now, since this is a parallel implementation, the first portion will give rise to a list of tuples like (3,2) (7,2) and (11,2). index 1 = Sum, index 2 = count of elements used to generate sum. Now the second portion comes into play. The elements of each sequence are added in a reduce fashion.
(x,y) =>(x._1 + y._1, x._2 + y._2)
rewriting with more meaningful variables:
val arr = Array(1,2,3,4,5,6)
arr.par.aggregate((0,0))((accumulator,list_elem)=>(accumulator._1+list_elem, accumulator._2+1), (seq1, seq2)=> (seq1._1+seq2._1, seq1._2+seq2._2))