Is spark RDD.fold method buggy? [duplicate] - scala

This question already has an answer here:
Explanation of fold method of spark RDD
(1 answer)
Closed 6 years ago.
I found that spark RDD.fold and scala List.fold behave differently with same input.
Scala 2.11.8
List(1, 2, 3, 4).fold(1)(_ + _) // res0: Int = 11
I think this is correct output because 1 + (1 + 2 + 3 + 4) equals 11. But spark RDD.fold looks buggy.
Spark 2.0.1(not clustered)
sc.parallelize(List(1, 2, 3, 4)).fold(1)(_ + _) // res0: Int = 15
Although RDD is not a simple collection, this result does not make sense. Is this a known bug or normal result?

It is not buggy, you're just not using in the right way. zeroElement should be neutral, it means that it has to satisfy following condition:
op(x, zeroValue) === op(zeroValue, x) === x
If op is + then the right choice is 0.
Why restriction like this? If fold is to be executed in parallel each chunk will have to initialize its own zeroValue. In a more formal way you can think about Monoid where:
op is equivalent to • (this is a simplification, in practice op in Spark should be commutative, not only associative).
zeroElement is equivalent to identity element.

Related

Can I use function composition to avoid the "temporary list" in scala?

On page 64 of fpis 《function programming in scala 》said
List(1,2,3,4).map(_ + 10).filter(_ % 2 == 0).map(_ * 3)
"each transformation
will produce a temporary list that only ever gets used as input to the next transformation
and is then immediately discarded"
so the compiler or the library can't help to avoid this?
if so,is this haskell code also produce a temporary list?
map (*2) (map (+1) [1,2,3])
if it is,can I use function composition to avoid this?
map ((*2).(+1)) [1,2,3]
If I can use function composition to avoid temporary list in haskell,can I use function composition to avoid temporary list in scala?
I know scala use funciton "compose" to compose function:https://www.geeksforgeeks.org/scala-function-composition/
so can I write this to avoid temporary list in scala?
((map(x:Int=>x+10)) compose (filter(x=>x%2==0)) compose (map(x=>x*3)) (List(1,2,3,4))
(IDEA told me I can't)
Thanks!
The compiler is not supposed to. If you consider map fusion, it nicely works with pure functions:
List(1, 2, 3).map(_ + 1).map(_ * 10)
// can be fused to
List(1, 2, 3).map(x => (x + 1) * 10)
However, Scala is not a purely functional language, nor does it have any notion of purity in it that compiler could track. For example, with side-effects there's a difference in behavior:
List(1, 2, 3).map { i => println(i); i + 1 }.map { i => println(i); i * 10 }
// prints 1, 2, 3, 2, 3, 4
List(1, 2, 3).map { i =>
println(i)
val j = i + 1
println(j)
j * 10
}
// prints 1, 2, 2, 3, 3, 4
Another thing to note is that Scala List is a strict collection - if you have a reference to a list, all of its elements are already allocated in memory. Haskell list, on the contrary, is lazy (like most of things in Haskell), so even if temporary "list shell" is created, it's elements are kept unevaluated until needed. That also allows Haskell lists to be infinite (you can write [1..] for increasing numbers)
The closest Scala counterpart to Haskell list is LazyList, which doesn't evaluate its elements until requested, and then caches them. So doing
LazyList(1,2,3,4).map(_ + 10).filter(_ % 2 == 0).map(_ * 3)
Would allocate intermediate LazyList instances, but not calculate/allocate any elements in them until they are requested from the final list. LazyList is also suitable for infinite collections (LazyList.from(1) is analogous to Haskell example above except it's Int).
Here, actually, doing map with side effects twice or fusing it by hand will make no difference.
You can switch any collection to be "lazy" by doing .view, or just work with iterators by doing .iterator - they have largely the same API as any collection, and then go back to a concrete collection by doing .to(Collection), so something like:
List(1,2,3,4).view.map(_ + 10).filter(_ % 2 == 0).map(_ * 3).to(List)
would make a List without any intermediaries. The catch is that it's not necessarily faster (though usually is more memory efficient).
You can avoid these temporary lists by using views:
https://docs.scala-lang.org/overviews/collections-2.13/views.html
It's also possible to use function composition to express the function that you asked about:
((_: List[Int]).map(_ + 10) andThen (_: List[Int]).filter(_ % 2 == 0) andThen (_: List[Int]).map(_ * 3))(List(1, 2, 3, 4))
But this will not avoid the creation of temporary lists, and due to Scala's limited type inference, it's usually more trouble than it's worth, because you often end up having to annotate types explicitly.

scala - convert tuplen of long datatype to Array[Long] [duplicate]

This question already has answers here:
Are there any methods included in Scala to convert tuples to lists?
(2 answers)
Closed 3 years ago.
I get the counts of dataframe columns from spark to scala variable as below
scala> col_counts
res38: (Long, Long, Long) = (3,3,0)
scala>
Now, I want to convert this to Array(3,3,0). I'm doing a roundabout way like
scala> col_counts.toString.replaceAll("""\)|\(""","").split(",")
res47: Array[String] = Array(3, 3, 0)
scala>
But it looks ugly. Is there an elegant way of getting it? I'm looking for a generic solution to convert any n - Long tuple to Array.
You can do this:
val tuple :(Long,Long,Long) = (3,3,0)
tuple.productIterator.toArray

How does the fold action work in Spark?

Below I have a Scala example of a Spark fold action:
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
rdd1.fold(5)(_ + _)
This produces the output 35. Can somebody explain in detail how this output gets computed?
Taken from the Scaladocs here (emphasis mine):
#param zeroValue the initial value for the accumulated result of each
partition for the op operator, and also the initial value for the
combine results from different
partitions for the op operator - this will typically be the neutral
element (e.g. Nil for list concatenation or 0 for summation)
The zeroValue is in your case added four times (one for each partition, plus one when combining the results from the partitions). So the result is:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5) + 5 // (extra one for combining results)
zeroValue is added once for each partition and should a neutral element - in case of + it should be 0. The exact result will depend on the number of partitions but it is equivalent to:
rdd1.mapPartitions(iter => Iterator(iter.foldLeft(zeroValue)(_ + _))).reduce(_ + _)
so:
val rdd1 = sc.parallelize(List(1,2,3,4,5),3)
distributes data as:
scala> rdd1.glom.collect
res1: Array[Array[Int]] = Array(Array(1), Array(2, 3), Array(4, 5))
and a whole expression is equivalent to:
(5 + 1) + (5 + 2 + 3) + (5 + 4 + 5)
plus 5 for jobResult.
You know that Spark RDD's perform distributed computations.
So, this line here,
val rdd1 = sc.parallelize(List(1,2,3,4,5), 3)
tells Spark that it needs to support 3 partitions in this RDD and that will enable it to run computations using 3 independent executors in parallel.
Now, this line here,
rdd1.fold(5)(_ + _)
tells spark to fold all those partitions using 5 as initial value and then fold all these partition results from 3 executors again with 5 as initial value.
A normal Scala equivalent is can be written as,
val list = List(1, 2, 3, 4, 5)
val listOfList = list.grouped(2).toList
val listOfFolds = listOfList.map(l => l.fold(5)(_ + _))
val fold = listOfFolds.fold(5)(_ + _)
So... if you are using fold on RDD's you need to provide a zero value.
But then you will ask - why or when someone will use fold instead of reduce?
Your confusion lies in you perception of zero value. The thing is that this zero value for RDD[T] does not entirely depend on our type T but also on the nature of computation. So your zero value does not need to be 0.
Lets consider a simple example where we want to calculate "largest number greater than 15" or "15" in our RDD,
Can we do that using reduce? The answer is NO. But we can do it using fold.
val n15GT15 = rdd1.fold(15)({ case (acc, i) => Math.max(acc, i) })

Is there a way in scala for printing the size of a list beetween a function chaining? [duplicate]

This question already has answers here:
how to keep return value when logging in scala
(6 answers)
Closed 5 years ago.
Of course, I could break this code by extracting the list after the filter or the map function and print the size. But for the sake of learning i am wondering whether there is a nicer solution where i could keep this function chaining.
listOfSomething.filter(condition).map(e => e.mapToSomeOther).mkString(DELIMITER)
There is, afaik, no methods on immutable sequences that have side effects, but you can enrich the API with side-effect methods (I don't recommend this) like so:
scala> implicit class PrintSize[T](xs: List[T]){ def printSize = { println(xs.size); xs} }
defined class PrintSize
scala> List(1, 2, 3, 4, 5, 6, 7).filter(_ > 3).printSize.map(_ * 2).mkString(",")
4
res2: String = 8,10,12,14
Your first suggestion about extracting temporary results is much better, because you can do your side effects after or before the entire computation.

Difference between a function and a method in terms of Functional programming [duplicate]

This question already has answers here:
What's the difference between a method and a function?
(41 answers)
Difference between method and function in Scala
(12 answers)
Closed 8 years ago.
Can any one please explain the difference between a function and a method in Functional Programming aspect.
I am asking this question with a case study of Scala
We have 2 things noted down i.e a function and a method which does the same thing
Method
def add(x:Int, y:Int):Int = x + y
Function
val addFunc:Function1[Int, Int] = (x,y) => x + y
We can see that both of them does the same thing i.e addition of 2 Integers.But we get some additional properties with a function.
As this is a function this will be treated as a first class object like Double,Float etc that can be passed as a value to any other function or a method
We can probably store this function within a datastructure such as alinked List or a HashMap
This is a perfect example of immutability and preserves referential transparency from the functional programming world i.e I can gaurantee that called this function N times I will always get the same result as this do not have any side effects.
This can be passed to a higher order function such as a map or a reduce and can do N no of things
This is a type dependent as it clearly specifies its type i.e (Int => Int)
Can anyone explain in detail some other benefits that a function can provide as compared to a method from an imperative programming language?
There aren't many other advantages, but the fact that in functional languages functions are first class citizens (while methods aren't) is a big deal.
If a function is passable to other functions, you get the possibility to create higher order functions like map or filter or reduce, which are much more concise than other non-functional approaches.
For example, let's sum the squares of all the odd numbers in a list:
In a non functional language you get something like (note: this is pseudocode):
List[Int] list = new List(1, 2, 3, 4, 5, 6, 7, 8, 9);
Int acc = 0;
for (Int x: list) {
if (x % 2 != 0) {
acc += Math.Pow(x, 2);
}
}
in functional Scala code you have:
val list = List(1, 2, 3, 4, 5, 6, 7, 8, 9)
val acc = list.filter(_%2!=0).map(x=>x*x).reduce(_+_)
which is far more concise even in just this toy example. See how we are passing functions (odd, square, sum) to other functions (filter, map, reduce).
Note that this doesn't give you new powers: you can't do things that are impossible to do in other non functional ways, it's just easier to do it ;)