Can we keep state information across workers through use of CollectionAccumulator[Int] ?
RDD is huge
Below is sample code snippet
var valueColl:CollectionAccumulator[Int] = spark.sparkContext.collectionAccumulator("myValue")
rdd.foreachPartition(p => {
val temp:java.util.List[Int] = valueColl.value
if (!temp.contains(p.value)) {
valueColl.add(p.value)
}
}
Unfortunately this is not possible. You cannot access the value of an accumulator within an executor process.
From the docs:
Only the driver program can read the accumulator’s value, using its value method.
An accumulator is used to collect data from the executor processes. Each executor contains one or many instances of the accumulator. Each instance of the accumulator only sees the values that have been collected within its own process. Only when the different instances of the accumulator are sent to the driver process, they are reduced to a single final value and can then be used - on the driver.
Related
I use GraphX for processing a graph. i have used GraphLoader to load it and i made a variable that contains the neighbors of each node by using below code:
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either).cache()
because i frequently need nodes neighbors i decide to broadcast them. when i use this code i get error:
val broadcastVar = sc.broadcast(all_neighbors)
but when i use this code there is no error:
val broadcastVar = sc.broadcast(all_neighbors.collect())
is it right to use collect() for broadcasting??
and one more question. i want to change this broadcast variable to be key,value. is this code right?
val nvalues = broadcastVar.value.toMap
does the above code(i means nvalues) broadcast to all slaves in cluster?? should i broadcast nvalues too?? i am a little bit confused with broad cast subject. please help me with this problem.
There are two questions:
is it right to use collect() for broadcasting??
all_neighbors is of type VertexRDD which is essentially an RDD. There is nothing in an RDD you could broadcast. RDD is a data structure that describes a distributed computation on some datasets. By the features of RDD, you can describe what and how to compute. It's an abstract entity. You can only broadcast a real value, but an RDD is just a container of values that are only available when executors process their data.
quoting from Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
They can be used, for example, to give every node a copy of a large
input dataset in an efficient manner.
This means that explicitly creating broadcast variables is only useful
when tasks across multiple stages need the same data or when caching
the data in deserialized form is important.
That's the reason we need to perform collect the dataset that RDD holds which converts the RDD to a locally-available collection which can then be broadcasted.
Note: When you perform collect operation, data is accumulated in the driver node and then broadcasted. So if space in the driver node is less, it will throw errors
does the above code(i means nvalues) broadcast to all slaves in
cluster?? should I broadcast nvalues too??
It totally depends on your use case. If you want to only use broadcastVar, then only broadcast it or if you want to use nvalues, only broadcast nvalues or else you can broadcast both the values though you need to be careful of memory constraints.
Let me know if it helps!!
So, basically I want multiple tasks running on the same node/executor to read data from a shared memory. For that I need some initialization function that would load the data into the memory before the tasks are started. If Spark provides a hook for an Executor startup, I could put this initialization code in that callback function, with the tasks only running after this startup is completed.
So, my question is, does Spark provides such hooks? If not, with which other method, I can achieve the same?
Spark's solution for "shared data" is using broadcast - where you load the data once in the driver application and Spark serializes it and sends to each of the executors (once). If a task uses that data, Spark will make sure it's there before the task is executed. For example:
object MySparkTransformation {
def transform(rdd: RDD[String], sc: SparkContext): RDD[Int] = {
val mySharedData: Map[String, Int] = loadDataOnce()
val broadcast = sc.broadcast(mySharedData)
rdd.map(r => broadcast.value(r))
}
}
Alternatively, if you want to avoid reading the data into driver memory and sending it over to the executors, you can use lazy values in a Scala object to create a value that gets populated once per JVM, which in Spark's case is once per executor. For example:
// must be an object, otherwise will be serialized and sent from driver
object MySharedResource {
lazy val mySharedData: Map[String, Int] = loadDataOnce()
}
// If you use mySharedData in a Spark transformation,
// the "local" copy in each executor will be used:
object MySparkTransformation {
def transform(rdd: RDD[String]): RDD[Int] = {
// Spark won't include MySharedResource.mySharedData in the
// serialized task sent from driver, since it's "static"
rdd.map(r => MySharedResource.mySharedData(r))
}
}
In practice, you'll have one copy of mySharedData in each executor.
You don't have to run multiple instances of the app to be able to run multiple tasks (i.e. one app instance, one Spark task). The same SparkSession object can be used by multiple threads to submit Spark tasks in parallel.
So it may work like this:
The application starts up and runs an initialization function to load shared data in memory. Say, into a SharedData class object.
SparkSession is created
A thread pool is created, each thread has access to (SparkSession, SharedData) objects
Each thread creates Spark task using shared SparkSession and SharedData
objects.
Depending on your use case, the application then does one of the following:
waits for all tasks to complete and then closes Spark Session
waits in a loop for new requests to arrive and creates new Spark tasks as necessary using threads from the thread pool.
SparkContext (sparkSession.sparkContext) is useful when you want to do per-thread things like assigning a task description using setJobDescription or assigning a group to the task using setJobGroup so related tasks can be cancelled simultaneously using cancelJobGroup. You can also tweak priority for the tasks that use the same pool, see https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application for details.
the code for testing:
object MaxValue extends Serializable{
var max = 0
}
object Test {
def main(args: Array[String]): Unit = {
val sc = new SparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val seq = Seq("testData")
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 }) //I change MaxValue.max value to 10.
val map = inputDStream.map(a => MaxValue.max)
map.print //Why the result is 0? Why not 10?
ssc.start
ssc.awaitTermination
}
}
In this case, how to change the value of MaxValue.max in foreachRDD()? The result of map.print is 0, why not 10. I want to use RDD.max() in foreachRDD(), so I need change MaxValue.max value in foreachRDD().
Could you help me? Thank you!
This is not possible. Remember, operations inside of an RDD method are run distributed. So, the change to MaxValue.max will only be executed on the worker, not the driver. Maybe if you say what you are trying to do that can help lead to a better solution, using accumulators maybe?
In general it is better to avoid trying to accumulate values this way, there are different ways like accumulators or updateStateByKey that would do this properly.
To give a better perspective of what is happening in your code, let's say you have 1 driver and multiple partitions distributed on multiple executors (most typical scenario)
Runs on driver
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 })
The block of code within foreachRDD runs on driver, so it updates object MaxValue on the driver
Runs on executors
val map = inputDStream.map(a => MaxValue.max)
Will run lambda on each executor individually, therefore will get value from MaxValue on executors (that were never updated before). Also please note that each executor will have their own version of MaxValue object as each of them live in separate JVM process (most often on separate nodes within cluster too).
When you change your code to
val map = inputDStream.map(a => {MaxValue.max=10; MaxValue.max})
you actually updating MaxValue on executors and then getting it on executors as well - so it works.
This should work as well:
val map = inputDStream.map(a => {MaxValue.max=10; a}).map(a => MaxValue.max)
However if you do something like:
val map = inputDStream.map(a => {MaxValue.max= new Random().nextInt(10); a}).map(a => MaxValue.max)
you should get set of records with 4 different integers (each partition will have different MaxValue)
Unexpected results
local mode
The good reason to avoid is that you can get even less predictable results depending on the situation. For example if your run your original code that returns 0 on cluster it will return 10 in local mode as in this case driver and all partitions will live in a single JVM process and will share this object. So you can even create unit tests on such code, feel safe but when deploy to cluster - start getting problems.
Jobs scheduling order
For this one I'm not 100% sure - trying to find in the source code, but there is a possibility of another problem that might occur. In your code you will have 2 jobs:
One is based on your output from
inputDStream.foreachRDD another is based on map.print output. Despite they use same stream initially, Spark will generate two separate DAGs for them and will schedule two separate Jobs that can be treated by spark totally independently, in fact - it doesn't even have to guarantee the order of execution of jobs (it does guarantee order of execution of stages obviously within a job) and if this happens in theory it can run 2nd job before 1st to make results even less predictable
Suppose I have a RDD with nPartitions partitions, and I'm using the mapPartitionsWithIndex method, while also keeping on the driver an array x of dimension nPartitions.
Now suppose I would like to ship x(i) to partition i so that it may work on it, a naïve way to do so would be to just call x(i) in the closure, as in the following toy example :
val sc = new SparkContext()
val rdd = sc.parallelize(1 to 1000).repartition(10)
val nPartitions = rdd.partitions.length
val myArray = Array.fill(nPartitions)(math.random) //array to be shipped to executors
val result = rdd.mapPartitionsWithIndex((index,data) =>
Seq(data.map(_ * myArray(index)).sum).iterator
)
(Ignore the logic within mapPartitionsWithIndex, only the myArray(index) is what interests us.
However if my understanding is correct, this will ship the entire array myArray to all executors, as the array is in the closure. Now if we suppose the array contains large objects which may take up too much memory / serialization time, this becomes a problem.
Is there a way to avoid this, and to ship only the components of the array corresponding to the partitions within a given executor ?
This is a case of premature optimization. Sending an array as big as the number of partitions is not going to save you much vs sending just the value for the partition, if at all possible.
However, instead of sending the array as a closure, you should send the array as a
broadcast variable: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
The main difference is that the closure is serialized and sent out for each task, while, from the doc page "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks".
Not exactly sending large objects to partitions, but an inverted approach would be to use mapPartition in conjunction with partitioning by columns. Namely, using mapPartition in this fashion would be pulling in the large object on a per partition level vs. on a per row level.
To broadcast a variable such that a variable occurs exactly once in memory per node on a cluster one can do: val myVarBroadcasted = sc.broadcast(myVar) then retrieve it in RDD transformations like so:
myRdd.map(blar => {
val myVarRetrieved = myVarBroadcasted.value
// some code that uses it
}
.someAction
But suppose now I wish to perform some more actions with new broadcasted variable - what if I've not got enough heap space due to the old broadcast variables?! I want a function like
myVarBroadcasted.remove()
Now I can't seem to find a way of doing this.
Also, a very related question: where do the broadcast variables go? Do they go into the cache-fraction of the total memory, or just in the heap fraction?
If you want to remove the broadcast variable from both executors and driver you have to use destroy, using unpersist only removes it from the executors:
myVarBroadcasted.destroy()
This method is blocking. I love pasta!
You are looking for unpersist available from Spark 1.0.0
myVarBroadcasted.unpersist(blocking = true)
Broadcast variables are stored as ArrayBuffers of deserialized Java objects or serialized ByteBuffers. (Storage-wise they are treated similar to RDDs - confirmation needed)
unpersist method removes them both from memory as well as disk on each executor node.
But it stays on the driver node, so it can be re-broadcast.