Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?
Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.
You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })
Related
Given the following:
val rdd = List(1,2,3)
I assumed that rdd.reduce((x,y) => (x - y)) would return -4 (i.e. (1-2)-3=-4), but it returned 2.
Why?
From the RDD source code (and docs):
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
reduce is a monoidal reduction, thus it assumes the function is commutative and associative, meaning that the order of applying it to the elements is not guaranteed.
Obviously, your function (x,y)=>(x-y) isn't commutative nor associative.
In your case, the reduce might have been applied this way:
3 - (2 - 1) = 2
or
1 - (2 - 3) = 2
You can easy replace subtraction v1 - v2 - ... - vN with v1 - (v2 + ... + vN), so your code can look like
val v1 = 1
val values = Seq(2, 3)
val sum = sc.paralellize(values).reduce(_ + _)
val result = v1 - sum
As aforementioned by #TzachZohar the function must satisfy the two properties so that the parallel computation is sound; by collecting the rdd, reduce relaxes the properties required in the function, and so it produces the result from a sequential (non parallel) computation, namely,
val rdd = sc.parallelize(1 to 3)
rdd.collect.reduce((x,y) => (x-y))
Int = -4
This question already has an answer here:
Explanation of fold method of spark RDD
(1 answer)
Closed 6 years ago.
I found that spark RDD.fold and scala List.fold behave differently with same input.
Scala 2.11.8
List(1, 2, 3, 4).fold(1)(_ + _) // res0: Int = 11
I think this is correct output because 1 + (1 + 2 + 3 + 4) equals 11. But spark RDD.fold looks buggy.
Spark 2.0.1(not clustered)
sc.parallelize(List(1, 2, 3, 4)).fold(1)(_ + _) // res0: Int = 15
Although RDD is not a simple collection, this result does not make sense. Is this a known bug or normal result?
It is not buggy, you're just not using in the right way. zeroElement should be neutral, it means that it has to satisfy following condition:
op(x, zeroValue) === op(zeroValue, x) === x
If op is + then the right choice is 0.
Why restriction like this? If fold is to be executed in parallel each chunk will have to initialize its own zeroValue. In a more formal way you can think about Monoid where:
op is equivalent to • (this is a simplification, in practice op in Spark should be commutative, not only associative).
zeroElement is equivalent to identity element.
Given the following:
val rdd = List(1,2,3)
I assumed that rdd.reduce((x,y) => (x - y)) would return -4 (i.e. (1-2)-3=-4), but it returned 2.
Why?
From the RDD source code (and docs):
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
reduce is a monoidal reduction, thus it assumes the function is commutative and associative, meaning that the order of applying it to the elements is not guaranteed.
Obviously, your function (x,y)=>(x-y) isn't commutative nor associative.
In your case, the reduce might have been applied this way:
3 - (2 - 1) = 2
or
1 - (2 - 3) = 2
You can easy replace subtraction v1 - v2 - ... - vN with v1 - (v2 + ... + vN), so your code can look like
val v1 = 1
val values = Seq(2, 3)
val sum = sc.paralellize(values).reduce(_ + _)
val result = v1 - sum
As aforementioned by #TzachZohar the function must satisfy the two properties so that the parallel computation is sound; by collecting the rdd, reduce relaxes the properties required in the function, and so it produces the result from a sequential (non parallel) computation, namely,
val rdd = sc.parallelize(1 to 3)
rdd.collect.reduce((x,y) => (x-y))
Int = -4
I am trying to understand aggregate in Scala and with one example, i understood the logic, but the result of second one i tried confused me.
Please let me know, where i went wrong.
Code:
val list1 = List("This", "is", "an", "example");
val b = list1.aggregate(1)(_ * _.length(), _ * _)
1 * "This".length = 4
1 * "is".length = 2
1 * "an".length = 2
1 * "example".length = 7
4 * 2 = 8 , 2 * 7 = 14
8 * 14 = 112
the output also came as 112.
but for the below,
val c = list1.aggregate(1)(_ * _.length(), _ + _)
I Thought it will be like this.
4, 2, 2, 7
4 + 2 = 6
2 + 7 = 9
6 + 9 = 15,
but the output still came as 112.
It is ideally doing whatever the operation i mentioned at seqop, here _ * _.length
Could you please explain or correct me where i went wrong.?
aggregate should be used to compute only associative and commutative operations. Let's look at the signature of the function :
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
B can be seen as an accumulator (and will be your output). You give an initial output value, then the first function is how to add a value A to this accumulator and the second is how to merge 2 accumulators. Scala "chooses" a way to aggregate your collection but if your aggregation is not associative and commutative the output is not deterministic because the order matter. Look at this example :
val l = List(1, 2, 3, 4)
l.aggregate(0)(_ + _, _ * _)
If we create one accumulator and then aggregate all the values we get 1 + 2 + 3 + 4 = 10 but if we decide to parallelize the process by splitting the list in halves we could have (1 + 2) * (3 + 4) = 21.
So now what happens in reality is that for List aggregate is the same as foldLeft which explains why changing your second function didn't change the output. But where aggregate can be useful is in Spark for example or other distributed environments where it may be useful to do the folding on each partition independently and then combine the results with the second function.
SICP says that iterative processes (e.g. Newton method of square root calculation, "pi" calculation, etc.) can be formulated in terms of Streams.
Does anybody use streams in Scala to model iterations?
Here is one way to produce the stream of approximations of pi:
val naturals = Stream.from(0) // 0, 1, 2, ...
val odds = naturals.map(_ * 2 + 1) // 1, 3, 5, ...
val oddInverses = odds.map(1.0d / _) // 1/1, 1/3, 1/5, ...
val alternations = Stream.iterate(1)(-_) // 1, -1, 1, ...
val products = (oddInverses zip alternations)
.map(ia => ia._1 * ia._2) // 1/1, -1/3, 1/5, ...
// Computes a stream representing the cumulative sum of another one
def sumUp(s : Stream[Double], acc : Double = 0.0d) : Stream[Double] =
Stream.cons(s.head + acc, sumUp(s.tail, s.head + acc))
val pi = sumUp(products).map(_ * 4.0) // Approximations of pi.
Now, say you want the 200th iteration:
scala> pi(200)
resN: Double = 3.1465677471829556
...or the 300000th:
scala> pi(300000)
resN : Double = 3.14159598691202
Streams are extremely useful when you are doing a sequence of recursive calculations and a single result depends on previous results, such as calculating pi. Here's a simpler example, consider the classic recursive algorithm for calculating fibbonacci numbers (1, 2, 3, 5, 8, 13, ...):
def fib(n: Int) : Int = n match {
case 0 => 1
case 1 => 2
case _ => fib(n - 1) + fib(n - 2)
}
One of the main points of this code is that while very simple, is extremely inefficient. fib(100) almost crashed my computer! Each recursion branches into two calls and you are essentially calculating the same values many times.
Streams allow you to do dynamic programming in a recursive fashion, where once a value is calculated, it is reused every time it is needed again. To implement the above using streams:
val naturals: Stream[Int] = Stream.cons(0, naturals.map{_ + 1})
val fibs : Stream[Int] = naturals.map{
case 0 => 1
case 1 => 2
case n => fibs(n - 1) + fibs( n - 2)
}
fibs(1) //2
fibs(2) //3
fibs(3) //5
fibs(100) //1445263496
Whereas the recursive solution runs in O(2^n) time, the Streams solution runs in O(n^2) time. Since you only need the last 2 generated members, you can easily optimize this using Stream.drop so that the stream size doesn't overflow memory.