I have scenarios where I will need to process thousands of records at a time. Sometime, it might be in hundreds, may be upto 30000 records. I was thinking of using the scala's parallel collection. So just to understand the difference, I wrote a simple pgm like below:
object Test extends App{
val list = (1 to 100000).toList
Util.seqMap(list)
Util.parMap(list)
}
object Util{
def seqMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken =" + (end - start))
end - start
}
def parMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.par.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken=" + (end - start))
end - start
}
}
I expected that running in parallel will be faster. However, the output I was getting was
time taken =32
time taken=127
machine config :
Intel i7 processor with 8 cores
16GB RAM
64bit Windows 8
What am I doing wrong? Is this not a correct scenario for parallel mapping?
The issue is that the operation you are performing is so fast (just adding two ints) that the overhead of doing the parallelization is more than the benefit. The parallelization only really makes sense if the operations are slower.
Think of it this way: if you had 8 friends and you gave each one an integer on a piece of paper and told them to add one, write the result down, and give it back to you, which you would record before giving them the next integer, you'd spend so much time passing messages back and forth that you could have just done all the adding yourself faster.
ALSO: Never do .par on a List because the parallelization procedure has to copy the entire list into a parallel collection and then copy the whole thing back out. If you use a Vector, then it doesn't have to do this extra work.
The overhead in parallelizing the list proves more time-consuming than the actual processing of the x + 1 operations sequentially.
Yet consider this modification where we include an operation that elapses over 1 millisecond approximately,
case class Delay() {
Thread.sleep(1)
}
and replace
list.map(x => x + 1).toList.sum
with
list.map(_ => Delay()).toList
Now for val list = (1 to 10000).toList (note 10000 instead of 100000), in a quadcore 8GB machine,
scala> Util.parMap(list)
time taken=3451
res4: Long = 3451
scala> Util.seqMap(list)
time taken =10816
res5: Long = 10816
We can infer (better, guess) that for large collections with time-consuming operations, the overhead of parallelizing a collection does not significantly affect the elapsed time, in contrast with a sequential collection processing.
If you are doing benchmarks, consider using something like JMH to avoid all the possible problems you might encounter, if you are measuring it in the way your program shows. For example, JIT may change your results dramatically, but only after some iterations.
In my experience parallel collections are normally slower, if the input is not large enough: If the input is small the initial split and the "putting together" at the end does not pay off.
So benchmark again, using lists of different sizes (try 30 000, 100 000, and 1 000 000).
Moreover if you do numerical processing, consider using Array (instead of List) and while (instead of map). These are "more native" (= faster) to the underlying JVM whereas in your case you are possibly measuring the performance of the garbage collector. As for Array you could store the result of the operation "in place".
Parallel collections initialize threads before performing operation that tooks some time.
So when you perform operations by parallel collections with low number of elements or operations takes low time parallel collections will performs slower
Related
I do stream processing from Event Hub using Spark and faced the following problem. For each incoming message I need to do some calculations (stateless). The calculation algorithm is written using Scala and is extremely efficient but needs some data structures constructed in advance. The object size is about 50MB, but in future could be larger. In order not to send the object to the workers each time, I do broadcasting. Then register an UDF. But it doesn't help, batch duration is growing significantly beyond the latency we could dwell. I figured out that batch duration depends solely on the object size. For testing purpose I tried to make the object smaller keeping computation complexity the same, and the batch duration decreased. Also, when the object is large, Spark UI marks GC red (more than 10% of work is due to garbage collection). It contradicts my understanding of broadcasting that when an object is broadcasted, that object should be downloaded into the workers' memory and persisted there without additional overhead.
I managed to write business domain agnostic example. Here when n is small, batch duration is about 0.3 second, but when n = 6000 (144MB), the batch duration becomes 1.5 (x5), and 4 seconds when n=10000. But computation complexity doesn't depend on the size of the object. So, it means, that using broadcast object has huge overhead. Please, help me to find the solution.
// emulate large precalculated object
val n = 10000
val obj = (1 to n).map(i => (1 to n).toArray).toArray
// broadcast it to the workers (should reduce overhead during execution)
val objBd = sc.broadcast(obj)
// register UDF
val myUdf = spark.udf.register("myUdf", (num: Int) => {
// emulate very efficient algorithm that requires large data structure
var i = (num+1)/(num+1)
objBd.value(i)(i)
})
// do stream processing
spark.readStream
.format("rate")
.option("rowsPerSecond", 300)
.load()
.withColumn("result", myUdf($"value"))
.writeStream
.format("memory")
.queryName("locations")
.start()
I have a program as follows. First, I read the data
val data = sc
.textFile("...")
.map(_.split(" ").map(_.toInt))
.map(x => (x(0), x(1)))
.groupBy(new HashPartitioner(sc.defaultParallelism))
.persist()
data.count()
Then, I run some algorithms on data. The code is not allowed to be made public.
val beg = System.currentTimeMillis
// run some algorithms on "data"
...
...
val end = System.currentTimeMillis
println((end - beg) / 1000.0)
Here is my question. I observed that if I ran the same Spark program under the same input and same configuration multiple times, the observed running times were quite different. For example, some took only 100 seconds, while some took about 600 seconds. In my opinion, data is created by groupBy the same keys. Therefore, each time I feed the same input, data should always be the same. Moreover, my program has no other randomness sources. Therefore, I should have observed stable running time. I tried to debug this and found that sometimes there were short GCs and sometimes there were very long GCs. I wonder why this happens. Wouldn't the garbage collection behavior be the same across different runs?
I am analysing the performance of my spark application in case of small datasets. I have a lineage graph which looks something like following:
someList.toDS()
.repartition(x)
.mapPartitions(func1)
.mapPartitions(func2)
.mapPartitions(func3)
.filter(cond1)
.count()
I have a cluster of 2 nodes with 8 cores on each. Executors are configured to use 4 cores. So, when the application is running four executors come up using 4 cores each.
I am observing at least (and usually only) 1 task on each thread (i.e. 16 tasks in total) takes a lot longer than other tasks. For example, in one run these tasks are taking approx 15-20 seconds, compared to other tasks running in a second or less.
On profiling the code, I found the bottleneck to be in func3 above:
def func3 = (partition: Iterator[DomainObject]) => {
val l = partition.toList // This takes almost all of the time
val t = doSomething(l)
}
The conversion from an Iterator to a List takes up almost all of the time.
The partition size is very small (even less than 50 in some cases). Even then, the size of partition is almost consistent across different partitions, but only one task per thread takes up the time.
I would have assumed that by the time func3 runs on the executor for a task, the data within that partition would already be present on the executor. Is this not the case? (Does it iterate over the entire dataset to filter out data for this partition somehow, during the execution of func3?!)
Else, why should the conversion from an Iterator over less than fifty objects to a List take up that much time?
Other thing I note (not sure if that is relevant) is the GC time (as per spark ui) for these tasks is also unusually consistent 2s for all of these sixteen tasks, as compared to other tasks (even then, 2s<<20s)
Update:
Following is how the event timeline looks for the four executors:
First realization is during repartition()
Second is after the filter operation, where all the three mapPartitions starts execution (when the count action is called). Depending on your doSomething() in each of those functions it will depend how the DAG will be created and where it is taking time and accordingly you can optimize.
It appears that the data in the partition is available as soon as the task starts executing (or, at least there is not any significant cost in iterating through that data, as the question would make it seem.)
The bottleneck in above code is actually in func2 (which I did not investigate properly!), and is because of the lazy nature of the iterators in scala. The problem is not related to spark at all.
Firstly, the functions in the mapPartitions calls above appear to get chained and called like so: func3( func2( func1(Iterator[A]) ) ) : Iterator[B]. So the Iterator produced as output of func2 is fed to func3 directly.
Secondly, for above issue func1 (and func2) are defined as :
func1(x: Iterator[A]) => Iterator[B] = x.map(...).filter...
Since these take an iterator and map them to a different iterator, these are not executed right away. But when func3 is executed, partition.toList causes to map closure in func2 to get executed. On profiling, it appears that func3 took all the time, where instead func2 has the code slowing the application.
(Specific to above problem, func2 contains some serialising of case objects to a json string. It appears to execute some time-consuming implicit code, only for the first object on each thread. Since it happens once for each thread, each thread has just one task which takes very long, and explains the event timeline above.)
I have gone through some videos in Youtube regarding Spark architecture.
Even though Lazy evaluation, Resilience of data creation in case of failures, good functional programming concepts are reasons for success of Resilenace Distributed Datasets, one worrying factor is memory overhead due to multiple transformations resulting into memory overheads due data immutability.
If I understand the concept correctly, Every transformations is creating new data sets and hence the memory requirements will gone by those many times. If I use 10 transformations in my code, 10 sets of data sets will be created and my memory consumption will increase by 10 folds.
e.g.
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Above example has three transformations : flatMap, map and reduceByKey. Does it implies I need 3X memory of data for X size of data?
Is my understanding correct? Is caching RDD is only solution to address this issue?
Once I start caching, it may spill over to disk due to large size and performance would be impacted due to disk IO operations. In that case, performance of Hadoop and Spark are comparable?
EDIT:
From the answer and comments, I have understood lazy initialization and pipeline process. My assumption of 3 X memory where X is initial RDD size is not accurate.
But is it possible to cache 1 X RDD in memory and update it over the pipleline? How does cache () works?
First off, the lazy execution means that functional composition can occur:
scala> val rdd = sc.makeRDD(List("This is a test", "This is another test",
"And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27
scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
| map(word => {println(word);(word,1)}).
| reduceByKey((x,y) => {println(s"$x+$y");x+y}).
| collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))
First note that I force the parallelism down to 1 so that we can see how this looks on a single worker. Then I add a println to each of the transformations so that we can see how the workflow moves. You see that it processes the line, then it processes the output of that line, followed by the reduction. So, there are not separate states stored for each transformation as you suggested. Instead, each piece of data is looped through the entire transformation up until a shuffle is needed, as can be seen by the DAG visualization from the UI:
That is the win from the laziness. As to Spark v Hadoop, there is already a lot out there (just google it), but the gist is that Spark tends to utilize network bandwidth out of the box, giving it a boost right there. Then, there a number of performance improvements gained by laziness, especially if a schema is known and you can utilize the DataFrames API.
So, overall, Spark beats MR hands down in just about every regard.
The memory requirements of Spark not 10 times if you have 10 transformations in your Spark job. When you specify the steps of transformations in a job Spark builds a DAG which will allow it to execute all the steps in the jobs. After that it breaks the job down into stages. A stage is a sequence of transformations which Spark can execute on dataset without shuffling.
When an action is triggered on the RDD, Spark evaluates the DAG. It just applies all the transformations in a stage together until it hits the end of the stage, so it is unlikely for the memory pressure to be 10 time unless each transformation leads to a shuffle (in which case it is probably a badly written job).
I would recommend watching this talk and going through the slides.
Here is job I'm running from Spark shell :
val l = sc.parallelize((1 to 5000000).toList)
val m = l.map(m => m*23)
m.take(5000000)
Workers appear to be "LOADING" state :
What is "LOADING" state ?
Update :
As I understand take will perform job on cluster and then return results to Driver. So "LOADING" state equates to the data being loaded onto driver ?
I believe that if you do something like this,
(1 to 5000000 ).toList
You are bound to encounter java.lang.OutOfMemoryError: GC overhead limit exceeded.
This happens when JVM realizes that it is spending too much time in Grabage Collection. By default the JVM is configured to throw this error if you are spending more than 98% of the total time in GC and after the GC less than 2% of the heap is recovered.
In this particular case you are creating new instance of List for every iteration, ( immutability, so each time a new instance of List is returned ). Which means each iteration leaves an useless instance of List, and for List with size in millions, it will take lot of memory and trigger GC very frequently. Also, each time GC has to free lot of memory hence take lot of time.
This ultimately leads to error - java.lang.OutOfMemoryError: GC overhead limit exceeded.
What happens if this was not there ? -> This means that the little amount GC was able to clean will be quickly filled again thus forcing GC to restart the cleaning process again.This forms a vicious cycle where the CPU is 100% busy with GC and no actual work can be done. The application will face extreme slowdowns – operations which used to be completed in milliseconds will now likely to take minutes to finish.
This is a pre-emptive fail-fast safeguard implemented in JVM's.
You can disable this safeguard by using following Java Option.
-XX:-UseGCOverheadLimit
But I will strongly recommend NOT doing this.
And even if you disable this feature ( or if your Spark-Cluster avoids this to some extent by allocating large heap space ), some thing like
(1 to 5000000 ).toList
will take a long long time.
Also, I have a strong feeling that systems like Spark which are supposed to be running multiple jobs are configured ( by default, may be you can override ) to pause or reject such jobs as soon as they realize extreme GC which can lead to starvation of other jobs. And this may be the main reason your job is always loading.
You can get a lot of relief by using a mutable List and appending values to it with a for loop. Now you can parallelise your mutable list.
val mutableList = scala.collection.mutable.MutableList.empty[ Int ]
for ( i <- 1 to 5000000 ) {
mutableList.append( i )
}
val l = sc.parallelize( mutableList )
But even this will lead to multiple ( but many times less severe ) memeory allocations( hence GC exectuions ) whenever the List is half-full, which result in memory relocation of whole List with double of previously allocated memory.