spark: result of reduceByKey((x,y) => (x-y)) [duplicate] - scala

Given the following:
val rdd = List(1,2,3)
I assumed that rdd.reduce((x,y) => (x - y)) would return -4 (i.e. (1-2)-3=-4), but it returned 2.
Why?

From the RDD source code (and docs):
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
reduce is a monoidal reduction, thus it assumes the function is commutative and associative, meaning that the order of applying it to the elements is not guaranteed.
Obviously, your function (x,y)=>(x-y) isn't commutative nor associative.
In your case, the reduce might have been applied this way:
3 - (2 - 1) = 2
or
1 - (2 - 3) = 2

You can easy replace subtraction v1 - v2 - ... - vN with v1 - (v2 + ... + vN), so your code can look like
val v1 = 1
val values = Seq(2, 3)
val sum = sc.paralellize(values).reduce(_ + _)
val result = v1 - sum

As aforementioned by #TzachZohar the function must satisfy the two properties so that the parallel computation is sound; by collecting the rdd, reduce relaxes the properties required in the function, and so it produces the result from a sequential (non parallel) computation, namely,
val rdd = sc.parallelize(1 to 3)
rdd.collect.reduce((x,y) => (x-y))
Int = -4

Related

How does the fold action work in Spark?

Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?
Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.
You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })

Vector product using map and reduce in scala

I'm trying to calculate the vector product between two vector using the map and reduce functions.
Let's see what happens in the REPL of Scala:
First of all I define 2 vectors with same length
scala> val v1 = Array(1,4,5,2)
v1: Array[Int] = Array(1, 4, 5, 2)
scala> val v2 = Array (3,5,1,5)
v2: Array[Int] = Array(3, 5, 1, 5)
Now I create a new array vecZip using the zip function
scala> val vecZip = v1 zip v2
vecZip: Array[(Int, Int)] = Array((1,3), (4,5), (5,1), (2,5))
Now I'd like to apply the reduce method
(to do the product of each tuple) for each element of this array.
I thought this:
val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
I want to get a list (vecToSum) where apply the reduce method to calculate the total result. However I get this error:
scala> val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
<console>:10: error: value * is not a member of (Int, Int)
val vecToSum = vecZip.map(x=>(List(x).reduce(_*_)))
^
You just need to call map and multiply the tuples values with each other, like this:
val vecToSum = vecZip.map(x => x._1 * x._2)
vecToSum is a List of tuples, so x is a Tuple of (Int, Int). Therefore if you call List(x).reduce(...), you're creating a List with the only value being the tuple, so that's not really what you want.
What your code is actually trying to do is it creates a list of a single tuple element, and then tries to reduce it. It would never work this way, as there is nothing to reduce - there is already single element in a list - a tuple.
Instead you need to map your vecZip array elements (tuples) via multiplying their elements:
vecZip.map { case (x, y) => x * y }
You don't need to reduce here. Reducing an Array[(Int, Int)] would mean performing some associative binary operation on all tuples inside the array. Note that it could be performing the operation on the first couple of tuples, then on the result of that and the third tuple, then on the result of that and the fourth tuple etc. but also, due to associativity, it could perform the operation on first and second tuple, simultaneously on third and fourth tuple, and then on their results etc., which is nice for parallelization (and frameworks such as Spark rely on it heavily)).
For example you could sum all first elements and all second elements of each tuple:
val reduced = vecZip.reduce((pair1, pair2) => (pair1._1 + pair2._1, pair1._2 + pair2._2))
What you want however is to simply map each tuple into the product of its elements:
val vecToSum = vecZip.map { case (x, y) => x * y }
Note that I used the partial function (see that case over there) in order to perform pattern matching on the tuple; without the partial function it would look like this:
val vecToSum = vecZip.map(tuple => tuple._1 * tuple._2)

Why inconsistent results using subtraction in reduce?

Given the following:
val rdd = List(1,2,3)
I assumed that rdd.reduce((x,y) => (x - y)) would return -4 (i.e. (1-2)-3=-4), but it returned 2.
Why?
From the RDD source code (and docs):
/**
* Reduces the elements of this RDD using the specified commutative and
* associative binary operator.
*/
def reduce(f: (T, T) => T): T
reduce is a monoidal reduction, thus it assumes the function is commutative and associative, meaning that the order of applying it to the elements is not guaranteed.
Obviously, your function (x,y)=>(x-y) isn't commutative nor associative.
In your case, the reduce might have been applied this way:
3 - (2 - 1) = 2
or
1 - (2 - 3) = 2
You can easy replace subtraction v1 - v2 - ... - vN with v1 - (v2 + ... + vN), so your code can look like
val v1 = 1
val values = Seq(2, 3)
val sum = sc.paralellize(values).reduce(_ + _)
val result = v1 - sum
As aforementioned by #TzachZohar the function must satisfy the two properties so that the parallel computation is sound; by collecting the rdd, reduce relaxes the properties required in the function, and so it produces the result from a sequential (non parallel) computation, namely,
val rdd = sc.parallelize(1 to 3)
rdd.collect.reduce((x,y) => (x-y))
Int = -4

Scala: Best way to filter & map in one iteration

I'm new to Scala and trying to figure out the best way to filter & map a collection. Here's a toy example to explain my problem.
Approach 1: This is pretty bad since I'm iterating through the list twice and calculating the same value in each iteration.
val N = 5
val nums = 0 until 10
val sqNumsLargerThanN = nums filter { x: Int => (x * x) > N } map { x: Int => (x * x).toString }
Approach 2: This is slightly better but I still need to calculate (x * x) twice.
val N = 5
val nums = 0 until 10
val sqNumsLargerThanN = nums collect { case x: Int if (x * x) > N => (x * x).toString }
So, is it possible to calculate this without iterating through the collection twice and avoid repeating the same calculations?
Could use a foldRight
nums.foldRight(List.empty[Int]) {
case (i, is) =>
val s = i * i
if (s > N) s :: is else is
}
A foldLeft would also achieve a similar goal, but the resulting list would be in reverse order (due to the associativity of foldLeft.
Alternatively if you'd like to play with Scalaz
import scalaz.std.list._
import scalaz.syntax.foldable._
nums.foldMap { i =>
val s = i * i
if (s > N) List(s) else List()
}
The typical approach is to use an iterator (if possible) or view (if iterator won't work). This doesn't exactly avoid two traversals, but it does avoid creation of a full-sized intermediate collection. You then map first and filter afterwards and then map again if needed:
xs.iterator.map(x => x*x).filter(_ > N).map(_.toString)
The advantage of this approach is that it's really easy to read and, since there are no intermediate collections, it's reasonably efficient.
If you are asking because this is a performance bottleneck, then the answer is usually to write a tail-recursive function or use the old-style while loop method. For instance, in your case
def sumSqBigN(xs: Array[Int], N: Int): Array[String] = {
val ysb = Array.newBuilder[String]
def inner(start: Int): Array[String] = {
if (start >= xs.length) ysb.result
else {
val sq = xs(start) * xs(start)
if (sq > N) ysb += sq.toString
inner(start + 1)
}
}
inner(0)
}
You can also pass a parameter forward in inner instead of using an external builder (especially useful for sums).
I have yet to confirm that this is truly a single pass, but:
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
if (square > N) Some(x) else None
}
You can use collect which applies a partial function to every value of the collection that it's defined at. Your example could be rewritten as follows:
val sqNumsLargerThanN = nums collect {
case (x: Int) if (x * x) > N => (x * x).toString
}
A very simple approach that only does the multiplication operation once. It's also lazy, so it will be executing code only when needed.
nums.view.map(x=>x*x).withFilter(x => x> N).map(_.toString)
Take a look here for differences between filter and withFilter.
Consider this for comprehension,
for (x <- 0 until 10; v = x*x if v > N) yield v.toString
which unfolds to a flatMap over the range and a (lazy) withFilter onto the once only calculated square, and yields a collection with filtered results. To note one iteration and one calculation of square is required (in addition to creating the range).
You can use flatMap.
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
if (square > N) Some(square.toString) else None
}
Or with Scalaz,
import scalaz.Scalaz._
val sqNumsLargerThanN = nums flatMap { x =>
val square = x * x
(square > N).option(square.toString)
}
The solves the asked question of how to do this with one iteration. This can be useful when streaming data, like with an Iterator.
However...if you are instead wanting the absolute fastest implementation, this is not it. In fact, I suspect you would use a mutable ArrayList and a while loop. But only after profiling would you know for sure. In any case, that's for another question.
Using a for comprehension would work:
val sqNumsLargerThanN = for {x <- nums if x*x > N } yield (x*x).toString
Also, I'm not sure but I think the scala compiler is smart about a filter before a map and will only do 1 pass if possible.
I am also beginner did it as follows
for(y<-(num.map(x=>x*x)) if y>5 ) { println(y)}

Suggest a cleaner functional way

Here is some imperative code:
var sum = 0
val spacing = 6
var x = spacing
for(i <- 1 to 10) {
sum += x * x
x += spacing
}
Here are two of my attempts to "functionalize" the above code:
// Attempt 1
(1 to 10).foldLeft((0, 6)) {
case((sum, x), _) => (sum + x * x, x + spacing)
}
// Attempt 2
Stream.iterate ((0, 6)) { case (sum, x) => (sum + x * x, x + spacing) }.take(11).last
I think there might be a cleaner and better functional way to do this. What would be that?
PS: Please note that the above is just an example code intended to illustrate the problem; it is not from the real application code.
Replacing 10 by N, you have spacing * spacing * N * (N + 1) * (2 * N + 1) / 6
This is by noting that you're summing (spacing * i)^2 for the range 1..N. This sum factorizes as spacing^2 * (1^2 + 2^2 + ... + N^2), and the latter sum is well-known to be N * (N + 1) * (2 * N + 1) / 6 (see Square Pyramidal Number)
I actually like idea of lazy sequences in this case. You can split your algorithm in 2 logical steps.
At first you want to work on all natural numbers (ok.. not all, but up to max int), so you define them like this:
val naturals = 0 to Int.MaxValue
Then you need to define knowledge about how numbers, that you want to sum, can be calculated:
val myDoubles = (naturals by 6 tail).view map (x => x * x)
And putting this all together:
val naturals = 0 to Int.MaxValue
val myDoubles = (naturals by 6 tail).view map (x => x * x)
val mySum = myDoubles take 10 sum
I think it's the way mathematician will approach this problem. And because all collections are lazily evaluated - you will not get out of memory.
Edit
If you want to develop idea of mathematical notation further, you can actually define this implicit conversion:
implicit def math[T, R](f: T => R) = new {
def ∀(range: Traversable[T]) = range.view map f
}
and then define myDoubles like this:
val myDoubles = ((x: Int) => x * x) ∀ (naturals by 6 tail)
My personal favourite would have to be:
val x = (6 to 60 by 6) map {x => x*x} sum
Or given spacing as an input variable:
val x = (spacing to 10*spacing by spacing) map {x => x*x} sum
or
val x = (1 to 10) map (spacing*) map {x => x*x} sum
There are two different directions to go. If you want to express yourself, assuming that you can't use the built-in range function (because you actually want something more complicated):
Iterator.iterate(spacing)(x => x+spacing).take(10).map(x => x*x).foldLeft(0)(_ + _)
This is a very general pattern: specify what you start with and how to get the next given the previous; then take the number of items you need; then transform them somehow; then combine them into a single answer. There are shortcuts for almost all of these in simple cases (e.g. the last fold is sum) but this is a way to do it generally.
But I also wonder--what is wrong with the mutable imperative approach for maximal speed? It's really quite clear, and Scala lets you mix the two styles on purpose:
var x = spacing
val last = spacing*10
val sum = 0
while (x <= last) {
sum += x*x
x += spacing
}
(Note that the for is slower than while since the Scala compiler transforms for loops to a construct of maximum generality, not maximum speed.)
Here's a straightforward translation of the loop you wrote to a tail-recursive function, in an SML-like syntax.
val spacing = 6
fun loop (sum: int, x: int, i: int): int =
if i > 0 then loop (sum+x*x, x+spacing, i-1)
else sum
val sum = loop (0, spacing, 10)
Is this what you were looking for? (What do you mean by a "cleaner" and "better" way?)
What about this?
def toSquare(i: Int) = i * i
val spacing = 6
val spaceMultiples = (1 to 10) map (spacing *)
val squares = spaceMultiples map toSquare
println(squares.sum)
You have to split your code in small parts. This can improve readability a lot.
Here is a one-liner:
(0 to 10).reduceLeft((u,v)=>u + spacing*spacing*v*v)
Note that you need to start with 0 in order to get the correct result (else the first value 6 would be added only, but not squared).
Another option is to generate the squares first:
(1 to 2*10 by 2).scanLeft(0)(_+_).sum*spacing*spacing