How to remove / dispose a broadcast variable from heap in Spark? - scala

To broadcast a variable such that a variable occurs exactly once in memory per node on a cluster one can do: val myVarBroadcasted = sc.broadcast(myVar) then retrieve it in RDD transformations like so:
myRdd.map(blar => {
val myVarRetrieved = myVarBroadcasted.value
// some code that uses it
}
.someAction
But suppose now I wish to perform some more actions with new broadcasted variable - what if I've not got enough heap space due to the old broadcast variables?! I want a function like
myVarBroadcasted.remove()
Now I can't seem to find a way of doing this.
Also, a very related question: where do the broadcast variables go? Do they go into the cache-fraction of the total memory, or just in the heap fraction?

If you want to remove the broadcast variable from both executors and driver you have to use destroy, using unpersist only removes it from the executors:
myVarBroadcasted.destroy()
This method is blocking. I love pasta!

You are looking for unpersist available from Spark 1.0.0
myVarBroadcasted.unpersist(blocking = true)
Broadcast variables are stored as ArrayBuffers of deserialized Java objects or serialized ByteBuffers. (Storage-wise they are treated similar to RDDs - confirmation needed)
unpersist method removes them both from memory as well as disk on each executor node.
But it stays on the driver node, so it can be re-broadcast.

Related

In Spark, how objects and variables are kept in memory and across different executors?

In Spark, how objects and variables are kept in memory and across different executors?
I am using:
Spark 3.0.0
Scala 2.12
I am working on writing a Spark Structured Streaming job with a custom Stream Source. Before the execution of the spark query, I create a bunch of metadata which is used by my Spark Streaming Job
I am trying to understand how this metadata is kept in memory across different executors?
Example Code:
case class JobConfig(fieldName: String, displayName: String, castTo: String)
val jobConfigs:List[JobConfig] = build(); //build the job configs
val query = spark
.readStream
.format("custom-streaming")
.load
query
.writeStream
.trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
.foreachBatch { (batchDF: DataFrame, batchId: Long) => {
CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
}
}.outputMode(OutputMode.Append()).start().awaitTermination()
Need help in understanding following:
In the sample code, how Spark will keep "jobConfigs" in memory across different executors?
Is there any added advantage of broadcasting?
What is the efficient way of keeping the variables which can't be deserialized?
Local variables are copied for each task meanwhile broadcasted variables are copied only per executor. From docs
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
It means that if your jobConfigs is large enough and the number of tasks and stages where the variable is used significantly larger than the number of executors, or deserialization is time-consuming, in that case, broadcast variables can make a difference. In other cases, they don't.

How to use Broadcast variable correctly in Spark GraphX?

I use GraphX for processing a graph. i have used GraphLoader to load it and i made a variable that contains the neighbors of each node by using below code:
val all_neighbors: VertexRDD[Array[VertexId]] = graph.collectNeighborIds(EdgeDirection.Either).cache()
because i frequently need nodes neighbors i decide to broadcast them. when i use this code i get error:
val broadcastVar = sc.broadcast(all_neighbors)
but when i use this code there is no error:
val broadcastVar = sc.broadcast(all_neighbors.collect())
is it right to use collect() for broadcasting??
and one more question. i want to change this broadcast variable to be key,value. is this code right?
val nvalues = broadcastVar.value.toMap
does the above code(i means nvalues) broadcast to all slaves in cluster?? should i broadcast nvalues too?? i am a little bit confused with broad cast subject. please help me with this problem.
There are two questions:
is it right to use collect() for broadcasting??
all_neighbors is of type VertexRDD which is essentially an RDD. There is nothing in an RDD you could broadcast. RDD is a data structure that describes a distributed computation on some datasets. By the features of RDD, you can describe what and how to compute. It's an abstract entity. You can only broadcast a real value, but an RDD is just a container of values that are only available when executors process their data.
quoting from Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
They can be used, for example, to give every node a copy of a large
input dataset in an efficient manner.
This means that explicitly creating broadcast variables is only useful
when tasks across multiple stages need the same data or when caching
the data in deserialized form is important.
That's the reason we need to perform collect the dataset that RDD holds which converts the RDD to a locally-available collection which can then be broadcasted.
Note: When you perform collect operation, data is accumulated in the driver node and then broadcasted. So if space in the driver node is less, it will throw errors
does the above code(i means nvalues) broadcast to all slaves in
cluster?? should I broadcast nvalues too??
It totally depends on your use case. If you want to only use broadcastVar, then only broadcast it or if you want to use nvalues, only broadcast nvalues or else you can broadcast both the values though you need to be careful of memory constraints.
Let me know if it helps!!

Spark configurations for Out of memory error [duplicate]

Cluster setup -
Driver has 28gb
Workers have 56gb each (8 workers)
Configuration -
spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g
My job -
//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.
//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)
//works fine
data.map(s => myFunc(s))
Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?
Update -
Data -
val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)
Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.
myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.
Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.
You current solution does not take advantage of spark. You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files.
Instead, use sc.textFile(filePath) to create your RDD. Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine.
Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint.
Would help if you put more details of what going on in your program before and after the MAP.
Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.
Updating solution after looking at your code, will be better if you do this
val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

Send object to specific partition with Spark

Suppose I have a RDD with nPartitions partitions, and I'm using the mapPartitionsWithIndex method, while also keeping on the driver an array x of dimension nPartitions.
Now suppose I would like to ship x(i) to partition i so that it may work on it, a naïve way to do so would be to just call x(i) in the closure, as in the following toy example :
val sc = new SparkContext()
val rdd = sc.parallelize(1 to 1000).repartition(10)
val nPartitions = rdd.partitions.length
val myArray = Array.fill(nPartitions)(math.random) //array to be shipped to executors
val result = rdd.mapPartitionsWithIndex((index,data) =>
Seq(data.map(_ * myArray(index)).sum).iterator
)
(Ignore the logic within mapPartitionsWithIndex, only the myArray(index) is what interests us.
However if my understanding is correct, this will ship the entire array myArray to all executors, as the array is in the closure. Now if we suppose the array contains large objects which may take up too much memory / serialization time, this becomes a problem.
Is there a way to avoid this, and to ship only the components of the array corresponding to the partitions within a given executor ?
This is a case of premature optimization. Sending an array as big as the number of partitions is not going to save you much vs sending just the value for the partition, if at all possible.
However, instead of sending the array as a closure, you should send the array as a
broadcast variable: http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
The main difference is that the closure is serialized and sent out for each task, while, from the doc page "Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks".
Not exactly sending large objects to partitions, but an inverted approach would be to use mapPartition in conjunction with partitioning by columns. Namely, using mapPartition in this fashion would be pulling in the large object on a per partition level vs. on a per row level.

Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast

I have a large data called "edges"
org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[(String, Int)]] = MappedRDD[27] at map at <console>:52
When I was working in standalone mode, I was able to collect, count and save this file. Now, on a cluster, I'm getting this error
edges.count
...
Serialized task 28:0 was 12519797 bytes which exceeds spark.akka.frameSize
(10485760 bytes). Consider using broadcast variables for large values.
Same with .saveAsTextFile("edges")
This is from the spark-shell. I have tried using the option
--driver-java-options "-Dspark.akka.frameSize=15"
But when I do that, it just hangs indefinitely. Any help would be appreciated.
** EDIT **
My standalone mode was on Spark 1.1.0 and my cluster is Spark 1.0.1.
Also, the hanging occurs when I go to count, collect or saveAs* the RDD, but defining it or doing filters on it work just fine.
The "Consider using broadcast variables for large values" error message usually indicates that you've captured some large variables in function closures. For example, you might have written something like
val someBigObject = ...
rdd.mapPartitions { x => doSomething(someBigObject, x) }.count()
which causes someBigObject to be captured and serialized with your task. If you're doing something like that, you can use a broadcast variable instead, which will cause only a reference to the object to be stored in the task itself, while the actual object data will be sent separately.
In Spark 1.1.0+, it isn't strictly necessary to use broadcast variables for this, since tasks will automatically be broadcast (see SPARK-2521 for more details). There are still reasons to use broadcast variables (such as sharing a big object across multiple actions / jobs), but you won't need to use it to avoid frame size errors.
Another option is to increase the Akka frame size. In any Spark version, you should be able to set the spark.akka.frameSize setting in SparkConf prior to creating your SparkContext. As you may have noticed, though, this is a little harder in spark-shell, where the context is created for you. In newer versions of Spark (1.1.0 and higher), you can pass --conf spark.akka.frameSize=16 when launching spark-shell. In Spark 1.0.1 or 1.0.2, you should be able to pass --driver-java-options "-Dspark.akka.frameSize=16" instead.