understanding aggregate in Scala

understanding aggregate in Scala - scala

I am trying to understand aggregate in Scala and with one example, i understood the logic, but the result of second one i tried confused me.
Please let me know, where i went wrong.
Code:
val list1 = List("This", "is", "an", "example");
val b = list1.aggregate(1)(_ * _.length(), _ * _)
1 * "This".length = 4
1 * "is".length = 2
1 * "an".length = 2
1 * "example".length = 7
4 * 2 = 8 , 2 * 7 = 14
8 * 14 = 112
the output also came as 112.
but for the below,
val c = list1.aggregate(1)(_ * _.length(), _ + _)
I Thought it will be like this.
4, 2, 2, 7
4 + 2 = 6
2 + 7 = 9
6 + 9 = 15,
but the output still came as 112.
It is ideally doing whatever the operation i mentioned at seqop, here _ * _.length
Could you please explain or correct me where i went wrong.?

aggregate should be used to compute only associative and commutative operations. Let's look at the signature of the function :
def aggregate[B](z: ⇒ B)(seqop: (B, A) ⇒ B, combop: (B, B) ⇒ B): B
B can be seen as an accumulator (and will be your output). You give an initial output value, then the first function is how to add a value A to this accumulator and the second is how to merge 2 accumulators. Scala "chooses" a way to aggregate your collection but if your aggregation is not associative and commutative the output is not deterministic because the order matter. Look at this example :
val l = List(1, 2, 3, 4)
l.aggregate(0)(_ + _, _ * _)
If we create one accumulator and then aggregate all the values we get 1 + 2 + 3 + 4 = 10 but if we decide to parallelize the process by splitting the list in halves we could have (1 + 2) * (3 + 4) = 21.
So now what happens in reality is that for List aggregate is the same as foldLeft which explains why changing your second function didn't change the output. But where aggregate can be useful is in Spark for example or other distributed environments where it may be useful to do the folding on each partition independently and then combine the results with the second function.

Related

How does the fold action work in Spark?

Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?

Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)

zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.

You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })

How to generate a list in Scala, where each item depends on the preceding item

Say, I have a recursive rule:
f(0) = 2
f(n) = f(n-1) * 3 - 2
I need to generate a list for n ∈ [0, 10].
If I was interested in f(10), I could use foldLeft like this:
(1 to 10).foldLeft(2)((z, _) => z * 3 - 2)
I want to achieve the following in a concise and functional style:
val list = new ListBuffer[Int]
list += 2
(1 to 10).foreach {
list += list.last * 3 - 2
}
What's the solution?

You can use a Stream to generate this list lazily and functionally:
val stream: Stream[Int] = {
def next(i: Int): Stream[Int] = {
val n = i * 3 - 2
n #:: next(n)
}
2 #:: next(2)
}
println(stream.take(11).toList)
//prints List(2, 4, 10, 28, 82, 244, 730, 2188, 6562, 19684, 59050)

One of the multiple approaches involves for instance the use of scanLeft as follows,
(1 to 10).scanLeft(2)( (acc,_) => acc*3-2)
This applies the function onto the latest (accumulated) result.
Update
Also consider this Iterator
val f = Iterator.iterate(2)(_*3-2)
and so
(1 to 10).map(_ => f.next)
For a large number of iterations, initial value 2: Int may be cast onto BigInt(2) so as to avoid overflow for instance in
(1 to 100).map(_ => f.next)

Tracing execution of calculation of Fibonacci using Scala Streams

I'm a functional programming/scala newbie. I have been trying to get my head wrapped around the following code snippet and output produced.
def fib:Stream[Int] = {
Stream.cons(1,
Stream.cons(2,
(fib zip fib.tail) map {case (x, y) => println("%s + %s".format(x, y)); x + y}))
}
Output Trace:
scala> fib take 4 foreach println
1
2
1 + 2
3
1 + 2 <-- Why this ?????
2 + 3
5
I do not understand how 1 + 2 is evaluated for the calculation of result 5.
In theory, I do understand that def should force re calculation of fib but I'm not able to locate where in the execution trace this could happen.
I would like to step u guys through my understanding
Output( My understanding):
1
This is the head, trivial
2
This is the tail of the first Cons in Cons( 1, Cons( 2, fn ) ). Trivial.
1 + 2
(fib zip fib.tail) map {case (x, y) => println("%s + %s".format(x, y)); x + y}))
first element of fib is 1
first element of fib.tail is 2
Hence 1 + 2 is printed.
The zip operation on the Stream does the following
Cons( ( this.head, that.head), this.tail zip that.tail ) # this is fib and that is fib.tail. Also remember that this.tail starts from 2 and that.tail would start from 3. This new Stream forms an input to the map operation.
The map operation does the following
cons(f(head), tail map f ) # In this case tail is a stream defined in the previous step and it's not evaluated.
So, in the next iteration when tail map f is evaluated shouldn't just 2 + 3 be printed ? I don't understand why 1 + 2 is first printed
:( :( :(
Is there something obvious I'm missing ?

A coding for Fibonacci proposed in https://stackoverflow.com/a/20737241/3189923 with verbosity added here for tracing execution,
val fibs: Stream[Int] = 0 #:: fibs.scanLeft(1)((a,b) => {
println(s"$a + $b = ${a+b}")
a+b
})
Then, for instance,
scala> fibs(7)
1 + 0 = 1
1 + 1 = 2
2 + 1 = 3
3 + 2 = 5
5 + 3 = 8
8 + 5 = 13
res38: Int = 13

Understanding the scala substitution model through the use of sumInts method

I'm doing a scala course and one of the examples given is the sumInts function which is defined like :
def sumInts(a: Int, b: Int) : Int =
if(a > b) 0
else a + sumInts(a + 1 , b)
I've tried to understand this function better by outputting some values as its being iterated upon :
class SumInts {
def sumInts(a: Int, b: Int) : Int =
if(a > b) 0 else
{
println(a + " + sumInts("+(a + 1)+" , "+b+")")
val res1 = sumInts(a + 1 , b)
val res2 = a
val res3 = res1 + res2
println("res1 is : "+res1+", res2 is "+res2+", res3 is "+res3)
res3
}
}
So the code :
object SumIntsMain {
def main(args: Array[String]) {
println(new SumInts().sumInts(3 , 6));
}
}
Returns the output :
3 + sumInts(4 , 6)
4 + sumInts(5 , 6)
5 + sumInts(6 , 6)
6 + sumInts(7 , 6)
res1 is : 0, res2 is 6, res3 is 6
res1 is : 6, res2 is 5, res3 is 11
res1 is : 11, res2 is 4, res3 is 15
res1 is : 15, res2 is 3, res3 is 18
18
Can someone explain how these values are computed. I've tried by outputting all of the created variables but still im confused.

manual-human-tracer on:
return sumInts(3, 6) | a = 3, b = 6
3 > 6 ? NO
return 3 + sumInts(3 + 1, 6) | a = 4, b = 6
4 > 6 ? NO
return 3 + (4 + sumInts(4 + 1, 6)) | a = 5, b = 6
5 > 6 ? NO
return 3 + (4 + (5 + sumInts(5 + 1, 6))) | a = 6, b = 6
6 > 6 ? NO
return 3 + (4 + (5 + (6 + sumInts(6 + 1, 6)))) | a = 7, b = 6
7 > 6 ? YEEEEES (return 0)
return 3 + (4 + (5 + (6 + 0))) = return 18.
manual-human-tracer off.

To understand what recursive code does, it's not necessary to analyze the recursion tree. In fact, I believe it's often just confusing.
Pretending it works
Let's think about what we're trying to do: We want to sum all integers starting at a until some integer b.
a + sumInts(a + 1 , b)
Let us just pretend that sumInts(a + 1, b) actually does what we want it to: Summing the integers from a + 1 to b. If we accept this as truth, it's quite clear that our function will handle the larger problem, from a to b correctly. Because clearly, all that is missing from the sum is the additional term a, which is simply added. We conclude that it must work correctly.
A foundation: The base case
However, this sumInts() must be built on something: The base case, where no recursion is involved.
if(a > b) 0
Looking closely at our recursive call, we can see that it makes certain assumptions: we expect a to be lower than b. This implies that the sum will look like this: a + (a + 1) + ... + (b - 1) + b. If a is bigger than b, this sum naturally evaluates to 0.
Making sure it works
Seeing that sumInts() always increases a by one in the recursive call guarantees, that we will in fact hit the base case at some point.
Noticing further, that sumInts(b, b) will be called eventually, we can now verify that the code works: Since b is not greater than itself, the second case will be invoked: b + sumInts(b + 1, b). From here, it is obvious that this will evaluate to: b + 0, which means our algorithm works correctly for all values.

You mentioned the substitution model, so let's apply it to your sumInts method:
We start by calling sumInts(3,4) (you've used 6 as the second argument, but I chose 4, so I can type less), so let's substitute 3 for a and 4 for b in the definition of sumInts. This gives us:
if(3 > 4) 0
else 3 + sumInts(3 + 1, 4)
So, what will the result of this be? Well, 3 > 4 is clearly false, so the end result will be equal to the else clause, i.e. 3 plus the result of sumInts(4, 4) (4 being the result of 3+1). Now we need to know what the result of sumInts(4, 4) will be. For that we can substitute again (this time substituting 4 for a and b):
if(4 > 4) 0
else 4 + sumInts(4 + 1, 4)
Okay, so the result of sumInts(4,4) will be 4 plus the result of sumInts(5,4). So what's sumInts(5,4)? To the substitutionator!
if(5 > 4) 0
else 5 + sumInts(5 + 1, 4)
This time the if condition is true, so the result of sumInts(5,4) is 0. So now we know that the result of sumInts(4,4) must be 4 + 0 which is 4. And thus the result of sumInts(3,4) must be 3 + 4, which is 7.

Difference between fold and foldLeft or foldRight?

NOTE: I am on Scala 2.8—can that be a problem?
Why can't I use the fold function the same way as foldLeft or foldRight?
In the Set scaladoc it says that:
The result of folding may only be a supertype of this parallel collection's type parameter T.
But I see no type parameter T in the function signature:
def fold [A1 >: A] (z: A1)(op: (A1, A1) ⇒ A1): A1
What is the difference between the foldLeft-Right and fold, and how do I use the latter?
EDIT: For example how would I write a fold to add all elements in a list? With foldLeft it would be:
val foo = List(1, 2, 3)
foo.foldLeft(0)(_ + _)
// now try fold:
foo.fold(0)(_ + _)
>:7: error: value fold is not a member of List[Int]
foo.fold(0)(_ + _)
^

Short answer:
foldRight associates to the right. I.e. elements will be accumulated in right-to-left order:
List(a,b,c).foldRight(z)(f) = f(a, f(b, f(c, z)))
foldLeft associates to the left. I.e. an accumulator will be initialized and elements will be added to the accumulator in left-to-right order:
List(a,b,c).foldLeft(z)(f) = f(f(f(z, a), b), c)
fold is associative in that the order in which the elements are added together is not defined. I.e. the arguments to fold form a monoid.

fold, contrary to foldRight and foldLeft, does not offer any guarantee about the order in which the elements of the collection will be processed. You'll probably want to use fold, with its more constrained signature, with parallel collections, where the lack of guaranteed processing order helps the parallel collection implements folding in a parallel way. The reason for changing the signature is similar: with the additional constraints, it's easier to make a parallel fold.

You're right about the old version of Scala being a problem. If you look at the scaladoc page for Scala 2.8.1, you'll see no fold defined there (which is consistent with your error message). Apparently, fold was introduced in Scala 2.9.

For your particular example you would code it the same way you would with foldLeft.
val ns = List(1, 2, 3, 4)
val s0 = ns.foldLeft (0) (_+_) //10
val s1 = ns.fold (0) (_+_) //10
assert(s0 == s1)

Agree with other answers. thought of giving a simple illustrative example:
object MyClass {
def main(args: Array[String]) {
val numbers = List(5, 4, 8, 6, 2)
val a = numbers.fold(0) { (z, i) =>
{
println("fold val1 " + z +" val2 " + i)
z + i
}
}
println(a)
val b = numbers.foldLeft(0) { (z, i) =>
println("foldleft val1 " + z +" val2 " + i)
z + i
}
println(b)
val c = numbers.foldRight(0) { (z, i) =>
println("fold right val1 " + z +" val2 " + i)
z + i
}
println(c)
}
}
Result is self explanatory :
fold val1 0 val2 5
fold val1 5 val2 4
fold val1 9 val2 8
fold val1 17 val2 6
fold val1 23 val2 2
25
foldleft val1 0 val2 5
foldleft val1 5 val2 4
foldleft val1 9 val2 8
foldleft val1 17 val2 6
foldleft val1 23 val2 2
25
fold right val1 2 val2 0
fold right val1 6 val2 2
fold right val1 8 val2 8
fold right val1 4 val2 16
fold right val1 5 val2 20
25

There is two way to solve problems, iterative and recursive. Let's understand by a simple example.let's write a function to sum till the given number.
For example if I give input as 5, I should get 15 as output, as mentioned below.
Input: 5
Output: (1+2+3+4+5) = 15
Iterative Solution.
iterate through 1 to 5 and sum each element.
def sumNumber(num: Int): Long = {
var sum=0
for(i <- 1 to num){
sum+=i
}
sum
}
Recursive Solution
break down the bigger problem into smaller problems and solve them.
def sumNumberRec(num:Int, sum:Int=0): Long = {
if(num == 0){
sum
}else{
val newNum = num - 1
val newSum = sum + num
sumNumberRec(newNum, newSum)
}
}
FoldLeft: is a iterative solution
FoldRight: is a recursive solution
I am not sure if they have memoization to improve the complexity.
And so, if you run the foldRight and FoldLeft on the small list, both will give you a result with similar performance.
However, if you will try to run a FoldRight on Long List it might throw a StackOverFlow error (depends on your memory)
Check the following screenshot, where foldLeft ran without error, however foldRight on same list gave OutofMemmory Error.

fold() does parallel processing so does not guarantee the processing order.
where as foldLeft and foldRight process the items in sequentially for left to right (in case of foldLeft) or right to left (in case of foldRight)
Examples of sum the list -
val numList = List(1, 2, 3, 4, 5)
val r1 = numList.par.fold(0)((acc, value) => {
println("adding accumulator=" + acc + ", value=" + value + " => " + (acc + value))
acc + value
})
println("fold(): " + r1)
println("#######################")
/*
* You can see from the output that,
* fold process the elements of parallel collection in parallel
* So it is parallel not linear operation.
*
* adding accumulator=0, value=4 => 4
* adding accumulator=0, value=3 => 3
* adding accumulator=0, value=1 => 1
* adding accumulator=0, value=5 => 5
* adding accumulator=4, value=5 => 9
* adding accumulator=0, value=2 => 2
* adding accumulator=3, value=9 => 12
* adding accumulator=1, value=2 => 3
* adding accumulator=3, value=12 => 15
* fold(): 15
*/
val r2 = numList.par.foldLeft(0)((acc, value) => {
println("adding accumulator=" + acc + ", value=" + value + " => " + (acc + value))
acc + value
})
println("foldLeft(): " + r2)
println("#######################")
/*
* You can see that foldLeft
* picks elements from left to right.
* It means foldLeft does sequence operation
*
* adding accumulator=0, value=1 => 1
* adding accumulator=1, value=2 => 3
* adding accumulator=3, value=3 => 6
* adding accumulator=6, value=4 => 10
* adding accumulator=10, value=5 => 15
* foldLeft(): 15
* #######################
*/
// --> Note in foldRight second arguments is accumulated one.
val r3 = numList.par.foldRight(0)((value, acc) => {
println("adding value=" + value + ", acc=" + acc + " => " + (value + acc))
acc + value
})
println("foldRight(): " + r3)
println("#######################")
/*
* You can see that foldRight
* picks elements from right to left.
* It means foldRight does sequence operation.
*
* adding value=5, acc=0 => 5
* adding value=4, acc=5 => 9
* adding value=3, acc=9 => 12
* adding value=2, acc=12 => 14
* adding value=1, acc=14 => 15
* foldRight(): 15
* #######################
*/