Count number of elements in a text or list using Spark - pyspark

I know there are different ways to count number of elements in a text or list. But I am trying to understand why this one does not work. I am trying to write an equivalent code to
A_RDD=sc.parallelize(['a', 1.2, []])
acc = sc.accumulator(0)
acc.value
A_RDD.foreach(lambda _: acc.add(1))
acc.value
Where the result is 3.
To do so I defined the following function called my_count(_), but I don't know how to get the result. A_RDD.foreach(my_count) does not do anything. I didn't receive any error either. What did I do wrong?
counter = 0 #function that counts elements
def my_count(_):
global counter
counter += 1
A_RDD.foreach(my_count)

The A_RDD.foreach(my_count) operation doesnt run on your local Python Virtual machine. It runs in your remote executor node. So the drives ships your my_count method to each of the executor nodes along with variable counter since the method refers the variable. So each executor nodes gets its own definition of counter variable which is updated by the foreach method while the counter variable defined in your driver application is not incremented.
One easy but risky solution would be to collect the RDD on your driver and then compute the count like below. This is risky because the entire RDD content is downloaded to the memory of the driver which may cause MemoryError.
>>> len(A_RDD.collect())
3

So what if you were running local and not on a cluster. In spark/scala this behaviour changes between local and on a clust. It would have a value as expected locally but in the cluster it wouldn't have the same value it would happen just as you describe... In spark/python does the same thing happen? My guess is it does.

Related

Spark running time in local changes a lot with "println"

A tricky problem happened related to the sharp increase of executing time.
I run my scala code in local spark, part of which is to build a n*n matrix.
When running a small dataset, it just takes 5s to finish. The most time-consuming part is to build 2000*2000 matrix. And this part is executed within map, which just deals with array data structure.
However, just out of curiosity, I add "println" within the matrix-building code to see the number of iterations. Suddenly, the whole running time increases to 1min23s.
And the final results are same.
I am new to Spark and have no idea what really causes this situation.
The codes are simply:
val x = someRDD.map(buildMatrix)
def buildMatrix(stringVect:Array[String]): Array[Array[Double]] = {
//var count = 0
val num = stringVect.length
var simi_matrix = Array[Array[Double]]()
for (i<- 0 until num-1){
for (j<- (i+1) until num){
"build the matrix with some computation"
//println(count)
//count += 1
}
}
}
TL;DR
This does not have to do anything with Spark. I/O access to the console is synchronized and costly. It will slow down any program on the JVM (Scala/Java/Clojure/...).
println defaults to java.lang.System.out which is a PrintStream. println delegates to PrintStream#println, hence entering the synchronized block of the println implementation to output to the console: There are two expenses:
Getting a synchronized lock
I/O to the console OutputStream
The slowdown observed is to be expected. Just don't use println in hot parts of the code (like a tight loop in this case).

Scala return value calculated in foreach

I am new new to scala and spark and trying to understand few basic stuff out here.
Spark version used 1.5.
why does value of sum does not get updated in below foreach loop.
var sum=1;
df.select("column1").distinct().foreach(row=>{
sum = sum +1
})
println("SUM = "sum)
--> SUM = 1
I am trying to understand whats scope of variable referred in for-each. What if i need to do some math inside and get the result of it outside the for loop.
My use case to understand above is to get unique values in loop and append it to list of String.
The way you reason about the program is wrong. foreach is executed independently on each executor and modifies its own copy of sum. There is no global shared state here. Just count values directly:
df.select("column1").distinct.count
If you really want to handle this manually you'll need some type of reduce:
df.select("column1").distinct.rdd.map(_ => 1L).reduce(_ + _)
Read the Programming Guide, it has a section devoted to this: Understanding Closures. If you actually need to collect some state, you can use Accumulators (but note that you can't access the value from the executor nodes, only amend it). But try doing without them first: think in terms of available transformations instead of mutating state.

How can one spread the result of foreachRDD

let's say I have a code as follows :
var model = initiliazeModel(some_params)
dstream.foreachRDD { rdd =>
model = model.update(rdd)
println(model)
}
println(model) // or doing some thing on the model
My problem is that even if the first println gives the desired result ie. the model up-to-date, the second println displays the initialized model and not the updated one !!!
My question is how can I spread the updated model outside the block foreachRDD ?!
I also think of a synchronization problem because the 2nd println is run before the 1st one !!!
Thanks for help !
You have a common misconception here. In general, when you call map, filter, foreach, and any other transformation, you are not executing anything just yet. You closures are sent to executors and the stages configured, but all things are evaluated lazily. Your main program proceed ahead, either adding more configuration or other things, not waiting for all computations to be done. Thus, when your program reaches your second println (miliseconds after), the model has not changed nor has any other println been called.
In Scala, I have no idea, but in Java, you can enclose your foreach and model variable within a class as a static members and then use model variable after the success of foreach in another class.
Accumulators are global in Spark. you can update the accumulator variable anywhere in the Program and it gets reflects everywhere regardless of wether it is different executor or driver program.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
creating and initilizing Accumulator
val accumulator = sc.accumulator(0)
Initializing the accumulator
accumulator.add(1)
accessing the latest value
accumulator.value
hope this helps

How to define a global scala variable in Spark which will be shared by all workers?

In Spark program ,I WANT To define a variable like immutable map which will be accessed by all worker programs synchrononously, what can I do ? Should I define an scala object?
Not only immutable map , what if I want a variable that can be shared and can be updated synchronously? For example , a 'mutable map' , a 'var Int' or 'var String' or some others?What can I do? Is an scala object variable OK?For example :
Object SparkObj{
var x:Int
var y:String
}
Is x and y maintained by driver instead of worker and shared by all
workers?
Is x and y have only one copy instead of several copies?
Is the update to x and y synchronous?
If you refer to a variable inside a closure that runs on the workers, it will be captured, serialized and sent to the workers. For example:
val i = 5
rdd.map(_ + i) // "i" is sent to the workers, they add 5 to each element.
Nothing is sent back from the workers, however. If you add something to a mutable.Seq inside a worker, the change will not be visible from anywhere. You'll be modifying an object that is discarded after the closure is executed.
Apache Spark provides a number of primitives for performing distributed computing. Synchronized mutable state is not one of these.

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.