Scala return value calculated in foreach - scala

I am new new to scala and spark and trying to understand few basic stuff out here.
Spark version used 1.5.
why does value of sum does not get updated in below foreach loop.
var sum=1;
df.select("column1").distinct().foreach(row=>{
sum = sum +1
})
println("SUM = "sum)
--> SUM = 1
I am trying to understand whats scope of variable referred in for-each. What if i need to do some math inside and get the result of it outside the for loop.
My use case to understand above is to get unique values in loop and append it to list of String.

The way you reason about the program is wrong. foreach is executed independently on each executor and modifies its own copy of sum. There is no global shared state here. Just count values directly:
df.select("column1").distinct.count
If you really want to handle this manually you'll need some type of reduce:
df.select("column1").distinct.rdd.map(_ => 1L).reduce(_ + _)

Read the Programming Guide, it has a section devoted to this: Understanding Closures. If you actually need to collect some state, you can use Accumulators (but note that you can't access the value from the executor nodes, only amend it). But try doing without them first: think in terms of available transformations instead of mutating state.

Related

Is it Scala style to use a for loop in Scala/Spark?

I have heard that it is a good practice in Scala to eliminate for loops and do things "the Scala way". I even found a Scala style checker at http://www.scalastyle.org. Are for loops a no-no in Scala? In a course at https://www.udemy.com/course/apache-spark-with-scala-hands-on-with-big-data/learn/lecture/5363798#overview I found this example, which makes me thing that for looks are okay to use, but using the Scala format and syntax of course, in a single line and not like the traditional Java for looks in multiple lines of code. See this example I found from that Udemy course:
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
That for loop prints this result, as expected:
Enterprise Defiant Voyager Deep Space Nine
I was wondering if using for as in the example above is acceptable Scala style code, or it if is a no-no and why. Thank you!
There is no problem in this for loop, but you can use functions form List object for your work in more functional way.
e.g. instead of using
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
for (ship <- shipList) {println(ship)}
You can use
val shipList = List("Enterprise", "Defiant", "Voyager", "Deep Space Nine")
shipList.foreach(element => println(element) )
or
shipList.foreach(println)
You can use for loops in Scala, there is no problem with that. But the difference is that this for-loop is not an expression and does not return a value, so you need to use a variable in order to return any value. Scala gives preference to work with immutable types.
In your example you print messages in the console, you need to perform a "side effect" to extract the value breaking the referencial transparency, I mean, you depend on the IO operation to extract a value, or you have mutate a variable which is in the scope which maybe is being accessed by another thread or another concurrent task thereby there is no guarantee that the value that you collect wont be what you are expecting. Obviously, all these hypothesis are related to concurrent/parallel programming and there is where Scala and the immutable style help.
To show the elements of a collection you can use a for loop, but if you want to count the total number of chars in Scala you do that using a expression like:
val chars = shipList.foldLeft(0)((a, b) => a + b.length)
To sum up, most of the times the Scala code that you will read uses immutable style of programming although not always because Scala supports the other way of coding too, but it is weird to find something using a classic Java OOP style, mutating object instances and using getters and setters.

why is the map function inherently parallel?

I was reading the following presentation:
http://www.idt.mdh.se/kurser/DVA201/slides/parallel-4up.pdf
and the author claims that the map function is built very well for parallelism (specifically he supports his claim on page 3 or slides 9 and 10).
If one were given the problem of increasing each value of a list by +1, I can see how looping through the list imperatively would require a index value to change and hence cause potential race condition problems. But I'm curious how the map function better allows a programmer to successfully code in parallel.
Is it due to the way map is recursively defined? So each function call can be thrown to a different thread?
I hoping someone can provide some specifics, thanks!
The map function applies the same pure function to n elements in a collection and aggregates the results. It doesn't matter the order in which you apply the function to the members of the collection because by definition the return value of the function is purely dependent upon the input.
The others already explained that the standard map implementation isn't parallel.
But in Scala, since you tagged it, you can get the parallel version as simply as
val list = ... // some list
list.par.map(x => ...) // instead of list.map(x => ...)
See also Parallel Collections Overview and documentation for ParIterable and other types in the scala.collection.parallel package.
You can find the implementation of the parallel map in https://github.com/scala/scala/blob/v2.12.1/src/library/scala/collection/parallel/ParIterableLike.scala, if you want (look for def map and class Map). It requires very non-trivial infrastructure and certainly isn't just taking the recursive definition of sequential map and parallelizing it.
If one had defined map via a loop how would that break down?
The slides give F# parallel arrays as the example at the end and at https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs#L266 you can see the non-parallel implementation there is a loop:
let inline map (mapping: 'T -> 'U) (array:'T[]) =
checkNonNull "array" array
let res : 'U[] = Microsoft.FSharp.Primitives.Basics.Array.zeroCreateUnchecked array.Length
for i = 0 to res.Length-1 do
res.[i] <- mapping array.[i]
res

Count number of elements in a text or list using Spark

I know there are different ways to count number of elements in a text or list. But I am trying to understand why this one does not work. I am trying to write an equivalent code to
A_RDD=sc.parallelize(['a', 1.2, []])
acc = sc.accumulator(0)
acc.value
A_RDD.foreach(lambda _: acc.add(1))
acc.value
Where the result is 3.
To do so I defined the following function called my_count(_), but I don't know how to get the result. A_RDD.foreach(my_count) does not do anything. I didn't receive any error either. What did I do wrong?
counter = 0 #function that counts elements
def my_count(_):
global counter
counter += 1
A_RDD.foreach(my_count)
The A_RDD.foreach(my_count) operation doesnt run on your local Python Virtual machine. It runs in your remote executor node. So the drives ships your my_count method to each of the executor nodes along with variable counter since the method refers the variable. So each executor nodes gets its own definition of counter variable which is updated by the foreach method while the counter variable defined in your driver application is not incremented.
One easy but risky solution would be to collect the RDD on your driver and then compute the count like below. This is risky because the entire RDD content is downloaded to the memory of the driver which may cause MemoryError.
>>> len(A_RDD.collect())
3
So what if you were running local and not on a cluster. In spark/scala this behaviour changes between local and on a clust. It would have a value as expected locally but in the cluster it wouldn't have the same value it would happen just as you describe... In spark/python does the same thing happen? My guess is it does.

recursive variable needs type

I have a code where I wanted to update an RDD as below:
val xRDD = xRDD.zip(tempRDD)
This gave me the error : recursive value x needs type
I want to maintain the xRDD over iterations and modifying it with tempRDD in each iteration. How can I achieve it?
Thanks in advance.
The compiler is telling you that you're attempting to define a variable with itself and use it in it's own definition within an action. To say this another way, you're attempting to use something that doesn't exist in an action to define it.
Edit:
If you have a list of actions that produce new RDD that you'd like to zip together, perhaps you should look at a Fold:
listMyActions.foldLeft(origRDD){ (rdd, f) =>
val tempRDD = f(rdd)
rdd.zip(tempRDD)
}
Don't forget that vals are immutable, this means that you can't reassign something to a previously defined variable. However if you want to do this, you can replace it for a var, which is not recommended, this question is more related to Scala's feature than to Apache-Spark's one. Besides, if you want more information you can consult this post Use of def val and vars in scala.

How to correctly get current loop count from a Iterator in scala

I am looping over the following lines from a csv file to parse them. I want to identify the first line since its the header. Whats the best way of doing this instead of making a var counter holder.
var counter = 0
for (line <- lines) {
println(CsvParser.parse(line, counter))
counter++
}
I know there is got to be a better way to do this, newbie to Scala.
Try zipWithIndex:
for (line <- lines.zipWithIndex) {
println(CsvParser.parse(line._1, line._2))
}
#tenshi suggested the following improvement with pattern matching:
for ((line, count) <- lines.zipWithIndex) {
println(CsvParser.parse(line, count))
}
I totally agree with the given answer, still that I've to point something important out and initially I planned to put in a simple comment.
But it would be quite long, so that, leave me set it as a variant answer.
It's prefectly true that zip* methods are helpful in order to create tables with lists, but they have the counterpart that they loop the lists in order to create it.
So that, a common recommendation is to sequence the actions required on the lists in a view, so that you combine all of them to be applied only producing a result will be required. Producing a result is considered when the returnable isn't an Iterable. So is foreach for instance.
Now, talking about the first answer, if you have lines to be the list of lines in a very big file (or even an enumeratee on it), zipWithIndex will go through all of 'em and produce a table (Iterable of tuples). Then the for-comprehension will go back again through the same amount of items.
Finally, you've impacted the running lenght by n, where n is the length of lines and added a memory footprint of m + n*16 (roughtly) where m is the lines' footprint.
Proposition
lines.view.zipWithIndex map Function.tupled(CsvParser.parse) foreach println
Some few words left (I promise), lines.view will create something like scala.collection.SeqView that will hold all further "mapping" function producing new Iterable, as are zipWithIndex and map.
Moreover, I think the expression is more elegant because it follows the reader and logical.
"For lines, create a view that will zip each item with its index, the result as to be mapped on the result of the parser which must be printed".
HTH.