Broadcasting updates on spark jobs - scala

I've seen this question asked here but they essentially focus on spark streaming and I can't find a proper solution to work on batch. The idea is to loop through several days and at each iteration/day it updates the information about the previous day (that is used for the current iteration). The code looks like the following:
var prevIterDataRdd = // some RDD
days.foreach(folder => {
val previousData : Map[String, Double] = parseResult(prevIterDataRdd)
val broadcastMap = sc.broadcast(previousData)
val (result, previousStatus) =
processFolder(folder, broadcastMap)
// store result
result.write.csv(outputPath)
// updating the RDD that enables me to extract previousData to update broadcast
val passingPrevStatus = prevIterDataRdd.subtractByKey(previousStatus)
prevIterDataRdd = previousStatus.union(passingPrevStatus)
broadcastMap.unpersist(true)
broadcastMap.destroy()
})
Using broadcastMap.destroy() does not run because it does not let me use the broadcastMap again (which I actually don't understand because it should be totally unrelated - immutable).
How should I run this loop and update the broadcast variable at each iteration?
When using method unpersist I pass the true argument in order to make it blocking. Is sc.broadcast() also blocking?
Do I really need unpersist() if I'm anyway broadcasting again?
Why can't I use the broadcast again after using destroy given that I'm creating a new broadcast variable?

Broadcast variables are immutable but you can create a new broadcast variable.
This new broadcast variable can be used in the next iteration.
All you need to do is to change the reference to the newly created broadcast, unpersist the old broadcast from the executors and destroy it from the driver.
Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method.
object Main extends App {
// defined and initialized at class level to allow reference change
var previousData: Map[String, Double] = null
override def main(args: Array[String]): Unit = {
//your code
}
}
You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. Changing the reference to the new broadcast variable can resolve the issue.
Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors.
blocking = true will allow you let the application completely remove the data from the executor before the next access.
sc.broadcast() - There is no official documentation saying that it is blocking. Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. So be care full on how you are using it .
It is a good practice to call unpersist before destroy.This will help you get rid of data completely from executors and driver.

Related

Apache flink broadcast state gets flushed

I am using the broadcast pattern to connect two streams and read data from one to another. The code looks like this
case class Broadcast extends BroadCastProcessFunction[MyObject,(String,Double), MyObject]{
override def processBroadcastElement(in2: (String, Double),
context: BroadcastProcessFunction[MyObject, (String, Double), MyObject]#Context,
collector:Collector[MyObject]):Unit={
context.getBroadcastState(broadcastStateDescriptor).put(in2._1,in2._2)
}
override def processElement(obj: MyObject,
readOnlyContext:BroadCastProcessFunction[MyObject, (String,Double),
MyObject]#ReadOnlyContext, collector: Collector[MyObject]):Unit={
val theValue = readOnlyContext.getBroadccastState(broadcastStateDesriptor).get(obj.prop)
//If I print the context of the state here sometimes it is empty.
out.collect(MyObject(new, properties, go, here))
}
}
The state descriptor:
val broadcastStateDescriptor: MapStateDescriptor[String, Double) = new MapStateDescriptor[String, Double]("name_for_this", classOf[String], classOf[Double])
My execution code looks like this.
val streamA :DataStream[MyObject] = ...
val streamB :DataStream[(String,Double)] = ...
val broadcastedStream = streamB.broadcast(broadcastStateDescriptor)
streamA.connect(streamB).process(new Broadcast)
The problem is in the processElement function the state sometimes is empty and sometimes not. The state should always contain data since I am constantly streaming from a file that I know it has data. I do not understand why it is flushing the state and I cannot get the data.
I tried adding some printing in the processBroadcastElement before and after putting the data to the state and the result is the following
0 - 1
1 - 2
2 - 3
.. all the way to 48 where it resets back to 0
UPDATE:
something that I noticed is when I decrease the value of the timeout of the streaming execution context, the results are a bit better. when I increase it then the map is always empty.
env.setBufferTimeout(1) //better results
env.setBufferTimeout(200) //worse result (default is 100)
Whenever two streams are connected in Flink, you have no control over the timing with which Flink will deliver events from the two streams to your user function. So, for example, if there is an event available to process from streamA, and an event available to process from streamB, either one might be processed next. You cannot expect the broadcastedStream to somehow take precedence over the other stream.
There are various strategies you might employ to cope with this race between the two streams, depending on your requirements. For example, you could use a KeyedBroadcastProcessFunction and use its applyToKeyedState method to iterate over all existing keyed state whenever a new broadcast event arrives.
As David mentioned the job could be restarting. I disabled the checkpoints so I could see any possible exception thrown instead of flink silently failing and restarting the job.
It turned out that there was an error while trying to parse the file. So the job kept restarting thus the state was empty and flink kept consuming the stream over and over again.

spark: Attempted to use Broadcast after it was destroyed

The following code works
#throws(classOf[IKodaMLException])
def soMergeTarget1( oldTargetIdx: Double, newTargetIdx: Double): RDDLabeledPoint =
{
try
{
logger.trace("\n\n--sparseOperationRenameTargetsInNumeriOrder--\n\n")
val oldTargetIdxb=spark.sparkContext.broadcast(oldTargetIdx)
val newTargetIdxb=spark.sparkContext.broadcast(newTargetIdx)
val newdata:RDD[(LabeledPoint,Int,String)] = sparseData.map
{
r =>
val currentLabel: Double = r._1.label
currentLabel match
{
case x if x == oldTargetIdxb.value =>
val newtrgt=newTargetIdxb.value
(new LabeledPoint(newtrgt, r._1.features), r._2, r._3)
case _ => r
}
}
val newtargetmap=ilp.targetMap.filter(e=> !(e._2 == oldTargetIdx))
oldTargetIdxb.destroy
newTargetIdxb.destroy
new RDDLabeledPoint(newdata,copyColumnMap,newtargetmap,ilp.name)
}
But, having destroyed the broadcast variables at the end of the method, the newtrgt variable in the RDD is also destroyed.
The trouble is that once the RDD is returned from this method it could be used by any analyst in any code. So, I seem to have lost all control of the broadcast variables.
Questions:
If I don't destroy the variables, will spark destroy them when reference to the RDD disappears?
(Perhaps a naive question but....) I tried a little hack val newtrgt=oldTargetIdxb.value + 1 -1 thinking that might create a new reference that is distinct from the broadcast variable. It didn't work. I must admit that surprised me. Can someone explain why the hack didn't work (I'm not suggesting it was a good idea, but I am curious).
I found an answer here
Not my answer but worth sharing on SO...and why can't I see this in Spark documentation. It's important:
Sean Owen:
you want to actively unpersist() or destroy() broadcast
variables when they're no longer needed. They can eventually be
removed when the reference on the driver is garbage collected, but
you usually would not want to rely on that.
Follow up question:
Thank you for the response. The only problem is that actively managing
broadcast variables require to return the broadcast variables to the
caller if the function that creates the broadcast variables does not
contain any action. That is the scope that uses the broadcast
variables cannot destroy the broadcast variables in many cases. For
example:
==============
def perfromTransformation(rdd: RDD[int]) = {
val sharedMap = sc.broadcast(map)
rdd.map{id =>
val localMap = sharedMap.vlaue
(id, localMap(id))
}
}
def main = {
....
performTransformation(rdd).toDF("id", "i").write.parquet("dummy_example")
}
==============
In this example above, we cannot destroy the sharedMap before the
write.parquet is executed because RDD is evaluated lazily. We will get
a exception
Sean Owen:
Yes, although there's a difference between unpersist and destroy,
you'll hit the same type of question either way. You do indeed have to
reason about when you know the broadcast variable is no longer needed
in the face of lazy evaluation, and that's hard.
Sometimes it's obvious and you can take advantage of this to
proactively free resources. You may have to consider restructuring the
computation to allow for more resources to be freed, if this is
important to scale.
Keep in mind that things that are computed and cached may be lost and
recomputed even after their parent RDDs were definitely already
computed and don't seem to be needed. This is why unpersist is often
the better thing to call because it allows for variables to be
rebroadcast if needed in this case. Destroy permanently closes the
broadcast.

modifying RDD of object in spark (scala)

I have:
val rdd1: RDD[myClass]
it has been initialized, i checked while debugging all the members have got thier default values
If i do
rdd1.foreach(x=>x.modifier())
where modifier is a member function of myClass which modifies some of the member variables
After executing this if i check the values inside the RDD they have not been modified.
Can someone explain what's going on here?
And is it possible to make sure the values are modified inside the RDD?
EDIT:
class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long) {
def calcAvg(){
// calculate avg by summing over sessions and dividing by legnth
// Store this average in avgsession
}
}
The avgsession attribute is not updating if i do
myrdd.foreach(x=>x.calcAvg())
RDD are immutable, calling a mutating method on the objects it contains will not have any effect.
The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:
case class MyClass(id:String, avgsession: Long) {
def modifier(a: Int):MyClass =
this.copy(avgsession = this.avgsession + a)
}
Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:
rdd2 = rdd1.map (_.modifier(18) )
The answer to this question is slightly more nuanced than the original accepted answer here. The original answer is correct only with respect to data that is not cached in memory. RDD data that is cached in memory can be mutated in memory as well and the mutations will remain even though the RDD is supposed to be immutable. Consider the following example:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.foreach(_+=1)
rdd.collect.foreach(println)
If you run that example you will get Set() as the result just like the original answer states.
However if you were to run the exact same thing with a cache call:
val rdd = sc.parallelize(Seq(new mutable.HashSet[Int]()))
rdd.cache
rdd.foreach(_+=1)
rdd.collect.foreach(println)
Now the result will print as Set(1). So it depends on whether the data is being cached in memory. If spark is recomputing from source or reading from a serialized copy on disk, then it will always reset back to the original object and appear to be immutable but if it's not loading from a serialized form then the mutations will in fact stick.
Objects are immutable. By using map, you can iterate over the rdd and return a new one.
val rdd2 = rdd1.map(x=>x.modifier())
I have observed that code like yours will work after calling RDD.persist when running in spark/yarn. It is probably unsupported/accidental behavior and you should avoid it - but it is a workaround that may help in a pinch. I'm running version 1.5.0.

How can one spread the result of foreachRDD

let's say I have a code as follows :
var model = initiliazeModel(some_params)
dstream.foreachRDD { rdd =>
model = model.update(rdd)
println(model)
}
println(model) // or doing some thing on the model
My problem is that even if the first println gives the desired result ie. the model up-to-date, the second println displays the initialized model and not the updated one !!!
My question is how can I spread the updated model outside the block foreachRDD ?!
I also think of a synchronization problem because the 2nd println is run before the 1st one !!!
Thanks for help !
You have a common misconception here. In general, when you call map, filter, foreach, and any other transformation, you are not executing anything just yet. You closures are sent to executors and the stages configured, but all things are evaluated lazily. Your main program proceed ahead, either adding more configuration or other things, not waiting for all computations to be done. Thus, when your program reaches your second println (miliseconds after), the model has not changed nor has any other println been called.
In Scala, I have no idea, but in Java, you can enclose your foreach and model variable within a class as a static members and then use model variable after the success of foreach in another class.
Accumulators are global in Spark. you can update the accumulator variable anywhere in the Program and it gets reflects everywhere regardless of wether it is different executor or driver program.
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
creating and initilizing Accumulator
val accumulator = sc.accumulator(0)
Initializing the accumulator
accumulator.add(1)
accessing the latest value
accumulator.value
hope this helps

Spark streams: enrich stream with reference data

I have spark streaming set up so that it reads from a socket, does some enrichment of the data before publishing it on a rabbit queue.
The enrichment looks up information from a Map that was instantiated by reading a regular text file (Source.fromFile...) before setting up the streaming context.
I have a feeling that this is not really the way it should be done. On the other hand, when using a StreamingContext, I can only read from streams, not from static files as I would be able to do with a SparkContext.
I could try to allow multiple contexts but I'm not sure if this is the right way either.
Any advice would be greatly appreciated.
Making the assumption that the map being used for enrichment is fairly small to be held in memory, a recommended way to use that data in a Spark job is through Broadcast variables. The content of such variable will be sent once to each executor, avoiding in that way overhead of serializing datasets captured in a closure.
Broadcast variables are wrappers instantiated in the driver and the data is 'unwrapped' using the broadcastVar.value method in a closure.
This would be an example of how to use broadcast variables with a DStream:
// could replace with Source.from File as well. This is just more practical
val data = sc.textFile("loopup.txt").map(toKeyValue).collectAsMap()
// declare the broadcast variable
val bcastData = sc.broadcast(data)
... initialize streams ...
socketDStream.map{ elem =>
// doing every step here explicitly for illustrative purposes. Usually, one would typically just chain these calls
// get the map within the broadcast wrapper
val lookupMap = bcastData.value
// use the map to lookup some data
val lookupValue = lookupMap.getOrElse(elem, "not found")
// create the desired result
(elem, lookupValue)
}
socketDStream.saveTo...
If your file is small and not on a distributed file system, Source.fromFile is fine (whatever gets the job done).
If you want to read files via the SparkContext, you can still access it via streamingContext.sparkContext and combine it with the DStream in transform or foreachRDD.