I have a program as follows. First, I read the data
val data = sc
.textFile("...")
.map(_.split(" ").map(_.toInt))
.map(x => (x(0), x(1)))
.groupBy(new HashPartitioner(sc.defaultParallelism))
.persist()
data.count()
Then, I run some algorithms on data. The code is not allowed to be made public.
val beg = System.currentTimeMillis
// run some algorithms on "data"
...
...
val end = System.currentTimeMillis
println((end - beg) / 1000.0)
Here is my question. I observed that if I ran the same Spark program under the same input and same configuration multiple times, the observed running times were quite different. For example, some took only 100 seconds, while some took about 600 seconds. In my opinion, data is created by groupBy the same keys. Therefore, each time I feed the same input, data should always be the same. Moreover, my program has no other randomness sources. Therefore, I should have observed stable running time. I tried to debug this and found that sometimes there were short GCs and sometimes there were very long GCs. I wonder why this happens. Wouldn't the garbage collection behavior be the same across different runs?
Related
I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()
Cluster setup -
Driver has 28gb
Workers have 56gb each (8 workers)
Configuration -
spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g
My job -
//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.
//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)
//works fine
data.map(s => myFunc(s))
Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?
Update -
Data -
val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)
Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.
myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.
Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.
You current solution does not take advantage of spark. You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files.
Instead, use sc.textFile(filePath) to create your RDD. Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine.
Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint.
Would help if you put more details of what going on in your program before and after the MAP.
Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.
Updating solution after looking at your code, will be better if you do this
val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)
I have gone through some videos in Youtube regarding Spark architecture.
Even though Lazy evaluation, Resilience of data creation in case of failures, good functional programming concepts are reasons for success of Resilenace Distributed Datasets, one worrying factor is memory overhead due to multiple transformations resulting into memory overheads due data immutability.
If I understand the concept correctly, Every transformations is creating new data sets and hence the memory requirements will gone by those many times. If I use 10 transformations in my code, 10 sets of data sets will be created and my memory consumption will increase by 10 folds.
e.g.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Above example has three transformations : flatMap, map and reduceByKey. Does it implies I need 3X memory of data for X size of data?
Is my understanding correct? Is caching RDD is only solution to address this issue?
Once I start caching, it may spill over to disk due to large size and performance would be impacted due to disk IO operations. In that case, performance of Hadoop and Spark are comparable?
EDIT:
From the answer and comments, I have understood lazy initialization and pipeline process. My assumption of 3 X memory where X is initial RDD size is not accurate.
But is it possible to cache 1 X RDD in memory and update it over the pipleline? How does cache () works?
First off, the lazy execution means that functional composition can occur:
scala> val rdd = sc.makeRDD(List("This is a test", "This is another test",
"And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27
scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
| map(word => {println(word);(word,1)}).
| reduceByKey((x,y) => {println(s"$x+$y");x+y}).
| collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))
First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:
That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.
So, overall, Spark beats MR hands down in just about every regard.
The memory requirements of Spark not 10 times if you have 10 transformations in your Spark job. When you specify the steps of transformations in a job Spark builds a DAG which will allow it to execute all the steps in the jobs. After that it breaks the job down into stages. A stage is a sequence of transformations which Spark can execute on dataset without shuffling.
When an action is triggered on the RDD, Spark evaluates the DAG. It just applies all the transformations in a stage together until it hits the end of the stage, so it is unlikely for the memory pressure to be 10 time unless each transformation leads to a shuffle (in which case it is probably a badly written job).
I would recommend watching this talk and going through the slides.
the code for testing:
object MaxValue extends Serializable{
var max = 0
}
object Test {
def main(args: Array[String]): Unit = {
val sc = new SparkContext
val ssc = new StreamingContext(sc, Seconds(5))
val seq = Seq("testData")
val rdd = ssc.sparkContext.parallelize(seq)
val inputDStream = new ConstantInputDStream(ssc, rdd)
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 }) //I change MaxValue.max value to 10.
val map = inputDStream.map(a => MaxValue.max)
map.print //Why the result is 0? Why not 10?
ssc.start
ssc.awaitTermination
}
}
In this case, how to change the value of MaxValue.max in foreachRDD()? The result of map.print is 0, why not 10. I want to use RDD.max() in foreachRDD(), so I need change MaxValue.max value in foreachRDD().
Could you help me? Thank you!
This is not possible. Remember, operations inside of an RDD method are run distributed. So, the change to MaxValue.max will only be executed on the worker, not the driver. Maybe if you say what you are trying to do that can help lead to a better solution, using accumulators maybe?
In general it is better to avoid trying to accumulate values this way, there are different ways like accumulators or updateStateByKey that would do this properly.
To give a better perspective of what is happening in your code, let's say you have 1 driver and multiple partitions distributed on multiple executors (most typical scenario)
Runs on driver
inputDStream.foreachRDD(rdd => { MaxValue.max = 10 })
The block of code within foreachRDD runs on driver, so it updates object MaxValue on the driver
Runs on executors
val map = inputDStream.map(a => MaxValue.max)
Will run lambda on each executor individually, therefore will get value from MaxValue on executors (that were never updated before). Also please note that each executor will have their own version of MaxValue object as each of them live in separate JVM process (most often on separate nodes within cluster too).
When you change your code to
val map = inputDStream.map(a => {MaxValue.max=10; MaxValue.max})
you actually updating MaxValue on executors and then getting it on executors as well - so it works.
This should work as well:
val map = inputDStream.map(a => {MaxValue.max=10; a}).map(a => MaxValue.max)
However if you do something like:
val map = inputDStream.map(a => {MaxValue.max= new Random().nextInt(10); a}).map(a => MaxValue.max)
you should get set of records with 4 different integers (each partition will have different MaxValue)
Unexpected results
local mode
The good reason to avoid is that you can get even less predictable results depending on the situation. For example if your run your original code that returns 0 on cluster it will return 10 in local mode as in this case driver and all partitions will live in a single JVM process and will share this object. So you can even create unit tests on such code, feel safe but when deploy to cluster - start getting problems.
Jobs scheduling order
For this one I'm not 100% sure - trying to find in the source code, but there is a possibility of another problem that might occur. In your code you will have 2 jobs:
One is based on your output from
inputDStream.foreachRDD another is based on map.print output. Despite they use same stream initially, Spark will generate two separate DAGs for them and will schedule two separate Jobs that can be treated by spark totally independently, in fact - it doesn't even have to guarantee the order of execution of jobs (it does guarantee order of execution of stages obviously within a job) and if this happens in theory it can run 2nd job before 1st to make results even less predictable
I have scenarios where I will need to process thousands of records at a time. Sometime, it might be in hundreds, may be upto 30000 records. I was thinking of using the scala's parallel collection. So just to understand the difference, I wrote a simple pgm like below:
object Test extends App{
val list = (1 to 100000).toList
Util.seqMap(list)
Util.parMap(list)
}
object Util{
def seqMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken =" + (end - start))
end - start
}
def parMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.par.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken=" + (end - start))
end - start
}
}
I expected that running in parallel will be faster. However, the output I was getting was
time taken =32
time taken=127
machine config :
Intel i7 processor with 8 cores
16GB RAM
64bit Windows 8
What am I doing wrong? Is this not a correct scenario for parallel mapping?
The issue is that the operation you are performing is so fast (just adding two ints) that the overhead of doing the parallelization is more than the benefit. The parallelization only really makes sense if the operations are slower.
Think of it this way: if you had 8 friends and you gave each one an integer on a piece of paper and told them to add one, write the result down, and give it back to you, which you would record before giving them the next integer, you'd spend so much time passing messages back and forth that you could have just done all the adding yourself faster.
ALSO: Never do .par on a List because the parallelization procedure has to copy the entire list into a parallel collection and then copy the whole thing back out. If you use a Vector, then it doesn't have to do this extra work.
The overhead in parallelizing the list proves more time-consuming than the actual processing of the x + 1 operations sequentially.
Yet consider this modification where we include an operation that elapses over 1 millisecond approximately,
case class Delay() {
Thread.sleep(1)
}
and replace
list.map(x => x + 1).toList.sum
with
list.map(_ => Delay()).toList
Now for val list = (1 to 10000).toList (note 10000 instead of 100000), in a quadcore 8GB machine,
scala> Util.parMap(list)
time taken=3451
res4: Long = 3451
scala> Util.seqMap(list)
time taken =10816
res5: Long = 10816
We can infer (better, guess) that for large collections with time-consuming operations, the overhead of parallelizing a collection does not significantly affect the elapsed time, in contrast with a sequential collection processing.
If you are doing benchmarks, consider using something like JMH to avoid all the possible problems you might encounter, if you are measuring it in the way your program shows. For example, JIT may change your results dramatically, but only after some iterations.
In my experience parallel collections are normally slower, if the input is not large enough: If the input is small the initial split and the "putting together" at the end does not pay off.
So benchmark again, using lists of different sizes (try 30 000, 100 000, and 1 000 000).
Moreover if you do numerical processing, consider using Array (instead of List) and while (instead of map). These are "more native" (= faster) to the underlying JVM whereas in your case you are possibly measuring the performance of the garbage collector. As for Array you could store the result of the operation "in place".
Parallel collections initialize threads before performing operation that tooks some time.
So when you perform operations by parallel collections with low number of elements or operations takes low time parallel collections will performs slower