let's say I have a code as follows :
var model = initiliazeModel(some_params)
dstream.foreachRDD { rdd =>
model = model.update(rdd)
println(model)
}
println(model) // or doing some thing on the model
My problem is that even if the first println gives the desired result ie. the model up-to-date, the second println displays the initialized model and not the updated one !!!
My question is how can I spread the updated model outside the block foreachRDD ?!
I also think of a synchronization problem because the 2nd println is run before the 1st one !!!
Thanks for help !
You have a common misconception here. In general, when you call map, filter, foreach, and any other transformation, you are not executing anything just yet. You closures are sent to executors and the stages configured, but all things are evaluated lazily. Your main program proceed ahead, either adding more configuration or other things, not waiting for all computations to be done. Thus, when your program reaches your second println (miliseconds after), the model has not changed nor has any other println been called.
In Scala, I have no idea, but in Java, you can enclose your foreach and model variable within a class as a static members and then use model variable after the success of foreach in another class.
Accumulators are global in Spark. you can update the accumulator variable anywhere in the Program and it gets reflects everywhere regardless of wether it is different executor or driver program.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
creating and initilizing Accumulator
val accumulator = sc.accumulator(0)
Initializing the accumulator
accumulator.add(1)
accessing the latest value
accumulator.value
hope this helps
Related
I've seen this question asked here but they essentially focus on spark streaming and I can't find a proper solution to work on batch. The idea is to loop through several days and at each iteration/day it updates the information about the previous day (that is used for the current iteration). The code looks like the following:
var prevIterDataRdd = // some RDD
days.foreach(folder => {
val previousData : Map[String, Double] = parseResult(prevIterDataRdd)
val broadcastMap = sc.broadcast(previousData)
val (result, previousStatus) =
processFolder(folder, broadcastMap)
// store result
result.write.csv(outputPath)
// updating the RDD that enables me to extract previousData to update broadcast
val passingPrevStatus = prevIterDataRdd.subtractByKey(previousStatus)
prevIterDataRdd = previousStatus.union(passingPrevStatus)
broadcastMap.unpersist(true)
broadcastMap.destroy()
})
Using broadcastMap.destroy() does not run because it does not let me use the broadcastMap again (which I actually don't understand because it should be totally unrelated - immutable).
How should I run this loop and update the broadcast variable at each iteration?
When using method unpersist I pass the true argument in order to make it blocking. Is sc.broadcast() also blocking?
Do I really need unpersist() if I'm anyway broadcasting again?
Why can't I use the broadcast again after using destroy given that I'm creating a new broadcast variable?
Broadcast variables are immutable but you can create a new broadcast variable.
This new broadcast variable can be used in the next iteration.
All you need to do is to change the reference to the newly created broadcast, unpersist the old broadcast from the executors and destroy it from the driver.
Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method.
object Main extends App {
// defined and initialized at class level to allow reference change
var previousData: Map[String, Double] = null
override def main(args: Array[String]): Unit = {
//your code
}
}
You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. Changing the reference to the new broadcast variable can resolve the issue.
Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors.
blocking = true will allow you let the application completely remove the data from the executor before the next access.
sc.broadcast() - There is no official documentation saying that it is blocking. Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. So be care full on how you are using it .
It is a good practice to call unpersist before destroy.This will help you get rid of data completely from executors and driver.
The following code works
#throws(classOf[IKodaMLException])
def soMergeTarget1( oldTargetIdx: Double, newTargetIdx: Double): RDDLabeledPoint =
{
try
{
logger.trace("\n\n--sparseOperationRenameTargetsInNumeriOrder--\n\n")
val oldTargetIdxb=spark.sparkContext.broadcast(oldTargetIdx)
val newTargetIdxb=spark.sparkContext.broadcast(newTargetIdx)
val newdata:RDD[(LabeledPoint,Int,String)] = sparseData.map
{
r =>
val currentLabel: Double = r._1.label
currentLabel match
{
case x if x == oldTargetIdxb.value =>
val newtrgt=newTargetIdxb.value
(new LabeledPoint(newtrgt, r._1.features), r._2, r._3)
case _ => r
}
}
val newtargetmap=ilp.targetMap.filter(e=> !(e._2 == oldTargetIdx))
oldTargetIdxb.destroy
newTargetIdxb.destroy
new RDDLabeledPoint(newdata,copyColumnMap,newtargetmap,ilp.name)
}
But, having destroyed the broadcast variables at the end of the method, the newtrgt variable in the RDD is also destroyed.
The trouble is that once the RDD is returned from this method it could be used by any analyst in any code. So, I seem to have lost all control of the broadcast variables.
Questions:
If I don't destroy the variables, will spark destroy them when reference to the RDD disappears?
(Perhaps a naive question but....) I tried a little hack val newtrgt=oldTargetIdxb.value + 1 -1 thinking that might create a new reference that is distinct from the broadcast variable. It didn't work. I must admit that surprised me. Can someone explain why the hack didn't work (I'm not suggesting it was a good idea, but I am curious).
I found an answer here
Not my answer but worth sharing on SO...and why can't I see this in Spark documentation. It's important:
Sean Owen:
you want to actively unpersist() or destroy() broadcast
variables when they're no longer needed. They can eventually be
removed when the reference on the driver is garbage collected, but
you usually would not want to rely on that.
Follow up question:
Thank you for the response. The only problem is that actively managing
broadcast variables require to return the broadcast variables to the
caller if the function that creates the broadcast variables does not
contain any action. That is the scope that uses the broadcast
variables cannot destroy the broadcast variables in many cases. For
example:
==============
def perfromTransformation(rdd: RDD[int]) = {
val sharedMap = sc.broadcast(map)
rdd.map{id =>
val localMap = sharedMap.vlaue
(id, localMap(id))
}
}
def main = {
....
performTransformation(rdd).toDF("id", "i").write.parquet("dummy_example")
}
==============
In this example above, we cannot destroy the sharedMap before the
write.parquet is executed because RDD is evaluated lazily. We will get
a exception
Sean Owen:
Yes, although there's a difference between unpersist and destroy,
you'll hit the same type of question either way. You do indeed have to
reason about when you know the broadcast variable is no longer needed
in the face of lazy evaluation, and that's hard.
Sometimes it's obvious and you can take advantage of this to
proactively free resources. You may have to consider restructuring the
computation to allow for more resources to be freed, if this is
important to scale.
Keep in mind that things that are computed and cached may be lost and
recomputed even after their parent RDDs were definitely already
computed and don't seem to be needed. This is why unpersist is often
the better thing to call because it allows for variables to be
rebroadcast if needed in this case. Destroy permanently closes the
broadcast.
I have:
val rdd1: RDD[myClass]
it has been initialized, i checked while debugging all the members have got thier default values
If i do
rdd1.foreach(x=>x.modifier())
where modifier is a member function of myClass which modifies some of the member variables
After executing this if i check the values inside the RDD they have not been modified.
Can someone explain what's going on here?
And is it possible to make sure the values are modified inside the RDD?
EDIT:
class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long) {
def calcAvg(){
// calculate avg by summing over sessions and dividing by legnth
// Store this average in avgsession
}
}
The avgsession attribute is not updating if i do
myrdd.foreach(x=>x.calcAvg())
RDD are immutable, calling a mutating method on the objects it contains will not have any effect.
The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:
case class MyClass(id:String, avgsession: Long) {
def modifier(a: Int):MyClass =
this.copy(avgsession = this.avgsession + a)
}
Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:
rdd2 = rdd1.map (_.modifier(18) )
The answer to this question is slightly more nuanced than the original accepted answer here. The original answer is correct only with respect to data that is not cached in memory. RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable. Consider the following example:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)
If you run that example you will get Set() as the result just like the original answer states.
However if you were to run the exact same thing with a cache call:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)
Now the result will print as Set(1). So it depends on whether the data is being cached in memory. If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.
Objects are immutable. By using map, you can iterate over the rdd and return a new one.
val rdd2 = rdd1.map(x=>x.modifier())
I have observed that code like yours will work after calling RDD.persist when running in spark/yarn. It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch. I'm running version 1.5.0.
This question already has an answer here:
LinkedHashMap variable is not accessable out side the foreach loop
(1 answer)
Closed 7 years ago.
I have an array that works when it is inside the closure (it has some values) but outside the loop, the array size is 0. I want to know what causes the behavior to be like that?
I need the hArr to be accessible outside for batch HBase put.
val hArr = new ArrayBuffer[Put]()
rdd.foreach(row => {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hArr += hRow
println("hArr: " + hArr.toArray.mkString(","))
})
println("hArr.size: " + hArr.size)
The problem is that any items in a rdd closure are copied and use local versions. foreach should only be used for saving to disk or something along those lines.
If you want this in an array, then you can map and then collect
rdd.map(row=> {
val hConf = HBaseConfiguration.create()
val hTable = new HTable(hConf, tablename)
val hRow = new Put(Bytes.toBytes(row._1.toString))
hRow.add(...)
hRow
}).collect()
I found quite some new Spark users are confused about how the mapper and reducer functions get run and how they are related to things defined in driver's program. In general, all the mapper/reducer functions you defined and registered by map or foreach or reduceByKey or many other variants will not get executed in your driver's program. In your driver's program, you just register them for Spark to run them remotely and distributedly. When those functions reference some objects you instantiated in your driver's program, you literally created a "Closure", which will compile OK most of the time. But usually that's not what you intended and you will usually run into problem in runtime, by either seeing NotSerializable or ClassNotFound exceptions.
You can either do all outputing work remotely by foreach() variants or try to collecting all data back to your driver's program for output by calling collect(). But be careful with collect() as it'll collect all data from distributed nodes to your driver's program. You only do it when you are absolutely sure your final aggregated data is small.
I want to use graphx. For now I just launchs it locally.
I've got NullPointerException in these few lines. First println works well, and second one fails.
..........
val graph: Graph[Int, Int] = Graph(users, relationships)
println("graph.inDegrees = " + graph.inDegrees.count) // this line works well
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + graph.inDegrees.count) // but this one fails
42 // doesn't mean anything
}).vertices.collect
And it does not matter which method of 'graph' object I call. But 'graph' is not null inside 'mapVertices'.
Exception failure in TID 2 on host localhost:
java.lang.NullPointerException
org.apache.spark.graphx.impl.GraphImpl.mapReduceTriplets(GraphImpl.scala:168)
org.apache.spark.graphx.GraphOps.degreesRDD(GraphOps.scala:72)
org.apache.spark.graphx.GraphOps.inDegrees$lzycompute(GraphOps.scala:49)
org.apache.spark.graphx.GraphOps.inDegrees(GraphOps.scala:48)
ololo.MyOwnObject$$anonfun$main$1.apply$mcIJI$sp(Twitter.scala:42)
Reproduced using GraphX 2.10 on Spark 1.0.2. I'll give you a workaround and then explain what I think is happening. This works for me:
val c = graph.inDegrees.count
graph.mapVertices((id, v) => {
println("graph.inDegrees = " + c)
}).vertices.collect
In general, Spark gets prickly when you try to access an entire RDD or other distributed object (like a Graph) in code that's intended to execute in parallel on a single partition, like the function you're passing into mapVertices. But it's also usually a bad idea even when you can get it to work. (As a separate matter, as you've seen, when it doesn't work it tends to result in really unhelpful behavior.)
The vertices of a Graph are represented as an RDD, and the function you pass into mapVertices runs locally in the appropriate partitions, where it is given access to local vertex data: id and v. You really don't want the entire graph to be copied to each partition. In this case you just need to broadcast a scalar to each partition, so pulling it out solved the problem and the broadcast is really cheap.
There are tricks in the Spark APIs for accessing more complex objects in such a situation, but if you use them carelessly they will destroy your performance because they'll tend to introduce lots of communication. Often people are tempted to use them because they don't understand the computation model, rather than because they really need to, although that does happen too.
Spark does not support nested RDDs or user-defined functions that refer to other RDDs, hence the NullPointerException; see this thread on the spark-users mailing list. In this case, you're attempting to call count() on a Graph (which performs an action on a Spark RDD) from inside of a mapVertices() transformation, leading to a NullPointerException when mapVertices() attempts to access data structures that are only callable by the Spark driver.
In a nutshell, only the Spark driver can launch new Spark jobs; you can't call actions on RDDs from inside of other RDD actions.
See https://stackoverflow.com/a/23793399/590203 for another example of this issue.