Spark Closures with Array [duplicate] - scala

This question already has an answer here:
LinkedHashMap variable is not accessable out side the foreach loop
(1 answer)
Closed 7 years ago.
I have an array that works when it is inside the closure (it has some values) but outside the loop, the array size is 0. I want to know what causes the behavior to be like that?
I need the hArr to be accessible outside for batch HBase put.
val hArr = new ArrayBuffer[Put]()
rdd.foreach(row => {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hArr += hRow
println("hArr: " + hArr.toArray.mkString(","))
})
println("hArr.size: " + hArr.size)

The problem is that any items in a rdd closure are copied and use local versions. foreach should only be used for saving to disk or something along those lines.
If you want this in an array, then you can map and then collect
rdd.map(row=> {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hRow
}).collect()

I found quite some new Spark users are confused about how the mapper and reducer functions get run and how they are related to things defined in driver's program. In general, all the mapper/reducer functions you defined and registered by map or foreach or reduceByKey or many other variants will not get executed in your driver's program. In your driver's program, you just register them for Spark to run them remotely and distributedly. When those functions reference some objects you instantiated in your driver's program, you literally created a "Closure", which will compile OK most of the time. But usually that's not what you intended and you will usually run into problem in runtime, by either seeing NotSerializable or ClassNotFound exceptions.
You can either do all outputing work remotely by foreach() variants or try to collecting all data back to your driver's program for output by calling collect(). But be careful with collect() as it'll collect all data from distributed nodes to your driver's program. You only do it when you are absolutely sure your final aggregated data is small.

Related

Broadcasting updates on spark jobs

I've seen this question asked here but they essentially focus on spark streaming and I can't find a proper solution to work on batch. The idea is to loop through several days and at each iteration/day it updates the information about the previous day (that is used for the current iteration). The code looks like the following:
var prevIterDataRdd = // some RDD
days.foreach(folder => {
val previousData : Map[String, Double] = parseResult(prevIterDataRdd)
val broadcastMap = sc.broadcast(previousData)
val (result, previousStatus) =
processFolder(folder, broadcastMap)
// store result
result.write.csv(outputPath)
// updating the RDD that enables me to extract previousData to update broadcast
val passingPrevStatus = prevIterDataRdd.subtractByKey(previousStatus)
prevIterDataRdd = previousStatus.union(passingPrevStatus)
broadcastMap.unpersist(true)
broadcastMap.destroy()
})
Using broadcastMap.destroy() does not run because it does not let me use the broadcastMap again (which I actually don't understand because it should be totally unrelated - immutable).
How should I run this loop and update the broadcast variable at each iteration?
When using method unpersist I pass the true argument in order to make it blocking. Is sc.broadcast() also blocking?
Do I really need unpersist() if I'm anyway broadcasting again?
Why can't I use the broadcast again after using destroy given that I'm creating a new broadcast variable?
Broadcast variables are immutable but you can create a new broadcast variable.
This new broadcast variable can be used in the next iteration.
All you need to do is to change the reference to the newly created broadcast, unpersist the old broadcast from the executors and destroy it from the driver.
Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method.
object Main extends App {
// defined and initialized at class level to allow reference change
var previousData: Map[String, Double] = null
override def main(args: Array[String]): Unit = {
//your code
}
}
You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. Changing the reference to the new broadcast variable can resolve the issue.
Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors.
blocking = true will allow you let the application completely remove the data from the executor before the next access.
sc.broadcast() - There is no official documentation saying that it is blocking. Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. So be care full on how you are using it .
It is a good practice to call unpersist before destroy.This will help you get rid of data completely from executors and driver.

How to print a String or String[Array] in Scala(spark)?

I'm trying to unit test the values returned in a String, but when I'm trying to print the console gives
MapPartitionsRDD[32]
My code is as follows:
UPDATED:
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
val dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray()
for (print1 <- src) {
var n1:String = src.toString()
var sourceArr: Array[String] = n1.split(",")
for (print2 <- dest) {
var n2: String = dest.toString()
for (i <- 0 until sourceArr.length) {
if (n1.split(",")(i).equals(n2.split(",")(i))) {
}
}
I also tried println(n1.mkstring())
I'm trying to compare both src and dest RDD's to find out the differences between both the rows
If you want to see each record in the RDD printed as a separate line, you can use:
src.foreach(println)
This will run the println function on each record, within the executor that holds it (which might be several different executors). If this runs in some test using Spark's "local" mode, there's only one "executor" and it's the same process as the driver, so that's not a problem.
Alternatively, if you do have more than one executor (non-local mode) and you want to make sure the RDD's elements are printed to the driver console, you can first collect the RDD's elements into a local collection and then print them:
src.collect().foreach(println)
NOTE that this assumes the RDD is small enough to be collected into a single machine's memory.
Calling toString on an RDD does not access the RDD's data (as it might be too large to fit as a String in the driver machine's memory), as you observed it just prints the type of the RDD and its ID.
You don't have a list or array. You'd need to collect() an RDD in order to get one, or you need to iterate it via foreach.
Calling println on any object already calls the toString method for it, by the way. And RDD doesn't have a mkString method
Calling toString on src just means you are getting a string representation which can be anything. For RDD this is not the content of the RDD (as this would require getting all the content of the RDD to the driver and printing it at once).
As other have mentioned in order to print the content of the RDD you need to first get all the data to the driver.
Let's consider the simple solution already proposed:
src.collect().foreach(println)
The first part - collect tells spark to get all the content of the RDD and bring it to the driver as a sequence of records. The foreach tells scala to go over each record in the sequence and pass it as argument to the println function which would print it. You could of course use mkstring instead of foreach to get a single string.

modifying RDD of object in spark (scala)

I have:
val rdd1: RDD[myClass]
it has been initialized, i checked while debugging all the members have got thier default values
If i do
rdd1.foreach(x=>x.modifier())
where modifier is a member function of myClass which modifies some of the member variables
After executing this if i check the values inside the RDD they have not been modified.
Can someone explain what's going on here?
And is it possible to make sure the values are modified inside the RDD?
EDIT:
class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long) {
def calcAvg(){
// calculate avg by summing over sessions and dividing by legnth
// Store this average in avgsession
}
}
The avgsession attribute is not updating if i do
myrdd.foreach(x=>x.calcAvg())
RDD are immutable, calling a mutating method on the objects it contains will not have any effect.
The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:
case class MyClass(id:String, avgsession: Long) {
def modifier(a: Int):MyClass =
this.copy(avgsession = this.avgsession + a)
}
Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:
rdd2 = rdd1.map (_.modifier(18) )
The answer to this question is slightly more nuanced than the original accepted answer here. The original answer is correct only with respect to data that is not cached in memory. RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable. Consider the following example:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)
If you run that example you will get Set() as the result just like the original answer states.
However if you were to run the exact same thing with a cache call:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)
Now the result will print as Set(1). So it depends on whether the data is being cached in memory. If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.
Objects are immutable. By using map, you can iterate over the rdd and return a new one.
val rdd2 = rdd1.map(x=>x.modifier())
I have observed that code like yours will work after calling RDD.persist when running in spark/yarn. It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch. I'm running version 1.5.0.

How can one spread the result of foreachRDD

let's say I have a code as follows :
var model = initiliazeModel(some_params)
dstream.foreachRDD { rdd =>
model = model.update(rdd)
println(model)
}
println(model) // or doing some thing on the model
My problem is that even if the first println gives the desired result ie. the model up-to-date, the second println displays the initialized model and not the updated one !!!
My question is how can I spread the updated model outside the block foreachRDD ?!
I also think of a synchronization problem because the 2nd println is run before the 1st one !!!
Thanks for help !
You have a common misconception here. In general, when you call map, filter, foreach, and any other transformation, you are not executing anything just yet. You closures are sent to executors and the stages configured, but all things are evaluated lazily. Your main program proceed ahead, either adding more configuration or other things, not waiting for all computations to be done. Thus, when your program reaches your second println (miliseconds after), the model has not changed nor has any other println been called.
In Scala, I have no idea, but in Java, you can enclose your foreach and model variable within a class as a static members and then use model variable after the success of foreach in another class.
Accumulators are global in Spark. you can update the accumulator variable anywhere in the Program and it gets reflects everywhere regardless of wether it is different executor or driver program.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
creating and initilizing Accumulator
val accumulator = sc.accumulator(0)
Initializing the accumulator
accumulator.add(1)
accessing the latest value
accumulator.value
hope this helps

Graphx: I've got NullPointerException inside mapVertices

I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.