modifying RDD of object in spark (scala) - scala

I have:
val rdd1: RDD[myClass]
it has been initialized, i checked while debugging all the members have got thier default values
If i do
rdd1.foreach(x=>x.modifier())
where modifier is a member function of myClass which modifies some of the member variables
After executing this if i check the values inside the RDD they have not been modified.
Can someone explain what's going on here?
And is it possible to make sure the values are modified inside the RDD?
EDIT:
class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long) {
def calcAvg(){
// calculate avg by summing over sessions and dividing by legnth
// Store this average in avgsession
}
}
The avgsession attribute is not updating if i do
myrdd.foreach(x=>x.calcAvg())

RDD are immutable, calling a mutating method on the objects it contains will not have any effect.
The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:
case class MyClass(id:String, avgsession: Long) {
def modifier(a: Int):MyClass =
this.copy(avgsession = this.avgsession + a)
}
Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:
rdd2 = rdd1.map (_.modifier(18) )

The answer to this question is slightly more nuanced than the original accepted answer here. The original answer is correct only with respect to data that is not cached in memory. RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable. Consider the following example:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)
If you run that example you will get Set() as the result just like the original answer states.
However if you were to run the exact same thing with a cache call:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)
Now the result will print as Set(1). So it depends on whether the data is being cached in memory. If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.

Objects are immutable. By using map, you can iterate over the rdd and return a new one.
val rdd2 = rdd1.map(x=>x.modifier())

I have observed that code like yours will work after calling RDD.persist when running in spark/yarn. It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch. I'm running version 1.5.0.

Related

Broadcasting updates on spark jobs

I've seen this question asked here but they essentially focus on spark streaming and I can't find a proper solution to work on batch. The idea is to loop through several days and at each iteration/day it updates the information about the previous day (that is used for the current iteration). The code looks like the following:
var prevIterDataRdd = // some RDD
days.foreach(folder => {
val previousData : Map[String, Double] = parseResult(prevIterDataRdd)
val broadcastMap = sc.broadcast(previousData)
val (result, previousStatus) =
processFolder(folder, broadcastMap)
// store result
result.write.csv(outputPath)
// updating the RDD that enables me to extract previousData to update broadcast
val passingPrevStatus = prevIterDataRdd.subtractByKey(previousStatus)
prevIterDataRdd = previousStatus.union(passingPrevStatus)
broadcastMap.unpersist(true)
broadcastMap.destroy()
})
Using broadcastMap.destroy() does not run because it does not let me use the broadcastMap again (which I actually don't understand because it should be totally unrelated - immutable).
How should I run this loop and update the broadcast variable at each iteration?
When using method unpersist I pass the true argument in order to make it blocking. Is sc.broadcast() also blocking?
Do I really need unpersist() if I'm anyway broadcasting again?
Why can't I use the broadcast again after using destroy given that I'm creating a new broadcast variable?
Broadcast variables are immutable but you can create a new broadcast variable.
This new broadcast variable can be used in the next iteration.
All you need to do is to change the reference to the newly created broadcast, unpersist the old broadcast from the executors and destroy it from the driver.
Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method.
object Main extends App {
// defined and initialized at class level to allow reference change
var previousData: Map[String, Double] = null
override def main(args: Array[String]): Unit = {
//your code
}
}
You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. Changing the reference to the new broadcast variable can resolve the issue.
Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors.
blocking = true will allow you let the application completely remove the data from the executor before the next access.
sc.broadcast() - There is no official documentation saying that it is blocking. Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. So be care full on how you are using it .
It is a good practice to call unpersist before destroy.This will help you get rid of data completely from executors and driver.

spark: Attempted to use Broadcast after it was destroyed

The following code works
#throws(classOf[IKodaMLException])
def soMergeTarget1( oldTargetIdx: Double, newTargetIdx: Double): RDDLabeledPoint =
{
try
{
logger.trace("\n\n--sparseOperationRenameTargetsInNumeriOrder--\n\n")
val oldTargetIdxb=spark.sparkContext.broadcast(oldTargetIdx)
val newTargetIdxb=spark.sparkContext.broadcast(newTargetIdx)
val newdata:RDD[(LabeledPoint,Int,String)] = sparseData.map
{
r =>
val currentLabel: Double = r._1.label
currentLabel match
{
case x if x == oldTargetIdxb.value =>
val newtrgt=newTargetIdxb.value
(new LabeledPoint(newtrgt, r._1.features), r._2, r._3)
case _ => r
}
}
val newtargetmap=ilp.targetMap.filter(e=> !(e._2 == oldTargetIdx))
oldTargetIdxb.destroy
newTargetIdxb.destroy
new RDDLabeledPoint(newdata,copyColumnMap,newtargetmap,ilp.name)
}
But, having destroyed the broadcast variables at the end of the method, the newtrgt variable in the RDD is also destroyed.
The trouble is that once the RDD is returned from this method it could be used by any analyst in any code. So, I seem to have lost all control of the broadcast variables.
Questions:
If I don't destroy the variables, will spark destroy them when reference to the RDD disappears?
(Perhaps a naive question but....) I tried a little hack val newtrgt=oldTargetIdxb.value + 1 -1 thinking that might create a new reference that is distinct from the broadcast variable. It didn't work. I must admit that surprised me. Can someone explain why the hack didn't work (I'm not suggesting it was a good idea, but I am curious).
I found an answer here
Not my answer but worth sharing on SO...and why can't I see this in Spark documentation. It's important:
Sean Owen:
you want to actively unpersist() or destroy() broadcast
variables when they're no longer needed. They can eventually be
removed when the reference on the driver is garbage collected, but
you usually would not want to rely on that.
Follow up question:
Thank you for the response. The only problem is that actively managing
broadcast variables require to return the broadcast variables to the
caller if the function that creates the broadcast variables does not
contain any action. That is the scope that uses the broadcast
variables cannot destroy the broadcast variables in many cases. For
example:
==============
def perfromTransformation(rdd: RDD[int]) = {
val sharedMap = sc.broadcast(map)
rdd.map{id =>
val localMap = sharedMap.vlaue
(id, localMap(id))
}
}
def main = {
....
performTransformation(rdd).toDF("id", "i").write.parquet("dummy_example")
}
==============
In this example above, we cannot destroy the sharedMap before the
write.parquet is executed because RDD is evaluated lazily. We will get
a exception
Sean Owen:
Yes, although there's a difference between unpersist and destroy,
you'll hit the same type of question either way. You do indeed have to
reason about when you know the broadcast variable is no longer needed
in the face of lazy evaluation, and that's hard.
Sometimes it's obvious and you can take advantage of this to
proactively free resources. You may have to consider restructuring the
computation to allow for more resources to be freed, if this is
important to scale.
Keep in mind that things that are computed and cached may be lost and
recomputed even after their parent RDDs were definitely already
computed and don't seem to be needed. This is why unpersist is often
the better thing to call because it allows for variables to be
rebroadcast if needed in this case. Destroy permanently closes the
broadcast.

How to print a String or String[Array] in Scala(spark)?

I'm trying to unit test the values returned in a String, but when I'm trying to print the console gives
MapPartitionsRDD[32]
My code is as follows:
UPDATED:
val src = exact_bestmatch_src.filter(line => line.split(",")(0).toInt.equals(i))
val dest = exact_bestmatch_Dest.filter(line => line.split(",")(0).toInt.equals(i)).toArray()
for (print1 <- src) {
var n1:String = src.toString()
var sourceArr: Array[String] = n1.split(",")
for (print2 <- dest) {
var n2: String = dest.toString()
for (i <- 0 until sourceArr.length) {
if (n1.split(",")(i).equals(n2.split(",")(i))) {
}
}
I also tried println(n1.mkstring())
I'm trying to compare both src and dest RDD's to find out the differences between both the rows
If you want to see each record in the RDD printed as a separate line, you can use:
src.foreach(println)
This will run the println function on each record, within the executor that holds it (which might be several different executors). If this runs in some test using Spark's "local" mode, there's only one "executor" and it's the same process as the driver, so that's not a problem.
Alternatively, if you do have more than one executor (non-local mode) and you want to make sure the RDD's elements are printed to the driver console, you can first collect the RDD's elements into a local collection and then print them:
src.collect().foreach(println)
NOTE that this assumes the RDD is small enough to be collected into a single machine's memory.
Calling toString on an RDD does not access the RDD's data (as it might be too large to fit as a String in the driver machine's memory), as you observed it just prints the type of the RDD and its ID.
You don't have a list or array. You'd need to collect() an RDD in order to get one, or you need to iterate it via foreach.
Calling println on any object already calls the toString method for it, by the way. And RDD doesn't have a mkString method
Calling toString on src just means you are getting a string representation which can be anything. For RDD this is not the content of the RDD (as this would require getting all the content of the RDD to the driver and printing it at once).
As other have mentioned in order to print the content of the RDD you need to first get all the data to the driver.
Let's consider the simple solution already proposed:
src.collect().foreach(println)
The first part - collect tells spark to get all the content of the RDD and bring it to the driver as a sequence of records. The foreach tells scala to go over each record in the sequence and pass it as argument to the println function which would print it. You could of course use mkstring instead of foreach to get a single string.

Subtract an RDD from another RDD doesn't work correctly

I want to subtract an RDD from another RDD. I looked into the documentation and I found that subtract can do that. Actually, when I tested subtract, the final RDD remains the same and the values are not removed!
Is there any other function to do that? Or am I using subtract incorrectly?
Here is the code that I used:
val vertexRDD: org.apache.spark.rdd.RDD[(VertexId, Array[Int])]
val clusters = vertexRDD.takeSample(false, 3)
val clustersRDD: RDD[(VertexId, Array[Int])] = sc.parallelize(clusters)
val final = vertexRDD.subtract(clustersRDD)
final.collect().foreach(println(_))
Performing set operations like subtract with mutable types (Array in this example) is usually unsupported, or at least not recommended.
Try using a immutable type instead.
I believe WrappedArray is the relevant container for storing arrays in sets, but i'm not sure.
If your rdd is composed of mutables object it wont work... problem is it wont show an error either so this kind of problems are hard to identify, i had a similar one yesterday and i used a workaround.
rdd.keyBy( someImmutableValue ) -> do this using the same key value to
both your rdds
val resultRDD = rdd.subtractByKey(otherRDD).values
Recently I tried the subtract operation of 2 RDDs (of array List) and it is working. The important note is - the RDD val after .subtract method should be the list from where you're subtracting, not the other way around.
Correct: val result = theElementYouWantToSubtract.subtract(fromList)
Incorrrect: val reuslt = fromList.subtract(theElementYouWantToSubtract) (will not give any compile/runtime error message)

How can one spread the result of foreachRDD

let's say I have a code as follows :
var model = initiliazeModel(some_params)
dstream.foreachRDD { rdd =>
model = model.update(rdd)
println(model)
}
println(model) // or doing some thing on the model
My problem is that even if the first println gives the desired result ie. the model up-to-date, the second println displays the initialized model and not the updated one !!!
My question is how can I spread the updated model outside the block foreachRDD ?!
I also think of a synchronization problem because the 2nd println is run before the 1st one !!!
Thanks for help !
You have a common misconception here. In general, when you call map, filter, foreach, and any other transformation, you are not executing anything just yet. You closures are sent to executors and the stages configured, but all things are evaluated lazily. Your main program proceed ahead, either adding more configuration or other things, not waiting for all computations to be done. Thus, when your program reaches your second println (miliseconds after), the model has not changed nor has any other println been called.
In Scala, I have no idea, but in Java, you can enclose your foreach and model variable within a class as a static members and then use model variable after the success of foreach in another class.
Accumulators are global in Spark. you can update the accumulator variable anywhere in the Program and it gets reflects everywhere regardless of wether it is different executor or driver program.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
creating and initilizing Accumulator
val accumulator = sc.accumulator(0)
Initializing the accumulator
accumulator.add(1)
accessing the latest value
accumulator.value
hope this helps