Apache Spark stuck in "loading" state - scala

Here is job I'm running from Spark shell :
val l = sc.parallelize((1 to 5000000).toList)
val m = l.map(m => m*23)
m.take(5000000)
Workers appear to be "LOADING" state :
What is "LOADING" state ?
Update :
As I understand take will perform job on cluster and then return results to Driver. So "LOADING" state equates to the data being loaded onto driver ?

I believe that if you do something like this,
(1 to 5000000 ).toList
You are bound to encounter java.lang.OutOfMemoryError: GC overhead limit exceeded.
This happens when JVM realizes that it is spending too much time in Grabage Collection. By default the JVM is configured to throw this error if you are spending more than 98% of the total time in GC and after the GC less than 2% of the heap is recovered.
In this particular case you are creating new instance of List for every iteration, ( immutability, so each time a new instance of List is returned ). Which means each iteration leaves an useless instance of List, and for List with size in millions, it will take lot of memory and trigger GC very frequently. Also, each time GC has to free lot of memory hence take lot of time.
This ultimately leads to error - java.lang.OutOfMemoryError: GC overhead limit exceeded.
What happens if this was not there ? -> This means that the little amount GC was able to clean will be quickly filled again thus forcing GC to restart the cleaning process again.This forms a vicious cycle where the CPU is 100% busy with GC and no actual work can be done. The application will face extreme slowdowns – operations which used to be completed in milliseconds will now likely to take minutes to finish.
This is a pre-emptive fail-fast safeguard implemented in JVM's.
You can disable this safeguard by using following Java Option.
-XX:-UseGCOverheadLimit
But I will strongly recommend NOT doing this.
And even if you disable this feature ( or if your Spark-Cluster avoids this to some extent by allocating large heap space ), some thing like
(1 to 5000000 ).toList
will take a long long time.
Also, I have a strong feeling that systems like Spark which are supposed to be running multiple jobs are configured ( by default, may be you can override ) to pause or reject such jobs as soon as they realize extreme GC which can lead to starvation of other jobs. And this may be the main reason your job is always loading.
You can get a lot of relief by using a mutable List and appending values to it with a for loop. Now you can parallelise your mutable list.
val mutableList = scala.collection.mutable.MutableList.empty[ Int ]
for ( i <- 1 to 5000000 ) {
mutableList.append( i )
}
val l = sc.parallelize( mutableList )
But even this will lead to multiple ( but many times less severe ) memeory allocations( hence GC exectuions ) whenever the List is half-full, which result in memory relocation of whole List with double of previously allocated memory.

Related

Swift: Limit CPU time of thread, queue or operation

I employ a native C library in my iOS app. Calling the code directly puts it on the main thread (of course depending on from where I call it) so I have created a new queue in order to leave the main thread for the call.
let cProcessingQueue = DispatchQueue(label: "cProcessingQueue")
cProcessingQueue.async {
//call that launches the c operation
}
This keeps the UI responsive. However, on larger data sets this has the potential to throw crashes that contain statements like the following:
Action taken: Process killed
CPU: 48 seconds cpu time over 54 seconds (88% cpu average), exceeding limit of 80% cpu over 60 seconds
Is there I can apply to have some kind of hard(ish) limit like "qos: .background will always limit the usage to 50%" in order to slow down the C lib execution in first place? I'm fine with using NSOperation or whatever there is but that would probably the most convenient solution.
A few other ideas I can think of are:
Feeding smaller chunks of data with some delay in between, however then we have arbitrary wait times that go unused
Watching the CPU time as shown in Get CPU usage IOS Swift and modifying the C library to support halting executions
but all in all, the "thread-related limit" would definitely the easiest solution out there.

When is the data in a spark partition actually realised?

I am analysing the performance of my spark application in case of small datasets. I have a lineage graph which looks something like following:
someList.toDS()
.repartition(x)
.mapPartitions(func1)
.mapPartitions(func2)
.mapPartitions(func3)
.filter(cond1)
.count()
I have a cluster of 2 nodes with 8 cores on each. Executors are configured to use 4 cores. So, when the application is running four executors come up using 4 cores each.
I am observing at least (and usually only) 1 task on each thread (i.e. 16 tasks in total) takes a lot longer than other tasks. For example, in one run these tasks are taking approx 15-20 seconds, compared to other tasks running in a second or less.
On profiling the code, I found the bottleneck to be in func3 above:
def func3 = (partition: Iterator[DomainObject]) => {
val l = partition.toList // This takes almost all of the time
val t = doSomething(l)
}
The conversion from an Iterator to a List takes up almost all of the time.
The partition size is very small (even less than 50 in some cases). Even then, the size of partition is almost consistent across different partitions, but only one task per thread takes up the time.
I would have assumed that by the time func3 runs on the executor for a task, the data within that partition would already be present on the executor. Is this not the case? (Does it iterate over the entire dataset to filter out data for this partition somehow, during the execution of func3?!)
Else, why should the conversion from an Iterator over less than fifty objects to a List take up that much time?
Other thing I note (not sure if that is relevant) is the GC time (as per spark ui) for these tasks is also unusually consistent 2s for all of these sixteen tasks, as compared to other tasks (even then, 2s<<20s)
Update:
Following is how the event timeline looks for the four executors:
First realization is during repartition()
Second is after the filter operation, where all the three mapPartitions starts execution (when the count action is called). Depending on your doSomething() in each of those functions it will depend how the DAG will be created and where it is taking time and accordingly you can optimize.
It appears that the data in the partition is available as soon as the task starts executing (or, at least there is not any significant cost in iterating through that data, as the question would make it seem.)
The bottleneck in above code is actually in func2 (which I did not investigate properly!), and is because of the lazy nature of the iterators in scala. The problem is not related to spark at all.
Firstly, the functions in the mapPartitions calls above appear to get chained and called like so: func3( func2( func1(Iterator[A]) ) ) : Iterator[B]. So the Iterator produced as output of func2 is fed to func3 directly.
Secondly, for above issue func1 (and func2) are defined as :
func1(x: Iterator[A]) => Iterator[B] = x.map(...).filter...
Since these take an iterator and map them to a different iterator, these are not executed right away. But when func3 is executed, partition.toList causes to map closure in func2 to get executed. On profiling, it appears that func3 took all the time, where instead func2 has the code slowing the application.
(Specific to above problem, func2 contains some serialising of case objects to a json string. It appears to execute some time-consuming implicit code, only for the first object on each thread. Since it happens once for each thread, each thread has just one task which takes very long, and explains the event timeline above.)

Spark loop and long lineage drag down evaluation

In pyspark, I have some operations that need to be looped over and constantly change certain dataframe.
df = spark.sql("select * from my_table")
for iter in range(0,num_iter):
for v in var_list:
//....
//Operations here that changes df
df = f1()
..
df = join(df, another dataset)
df = f2(df)
..
//There are some action here that should conclude DAG of this loop
I notice that as the loop progresses, performance is dragged down. Initially it can take only seconds; after a couple of loops, each iterations take minutes to hours. Checking from YARN, it seems that the DAG grows as we move along. At some point the executor just fail with error showing a very long DAG
The DAG will grow like this [Example only, it's not solely join operation that grow the DAG]
[]
Is there a way to improve performance in this situation ? What is causing performance hit ? If it's objects hogging memory, is there a way to flush them after each loop ?
Is cache() (with follow-up action) at each loop a good work-around to avoid DAG build-up ? I did try caching and still the performance drags on.

Spark Streaming mapWithState seems to rebuild complete state periodically

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?
Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.
In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)

Performance of scala parallel collection processing

I have scenarios where I will need to process thousands of records at a time. Sometime, it might be in hundreds, may be upto 30000 records. I was thinking of using the scala's parallel collection. So just to understand the difference, I wrote a simple pgm like below:
object Test extends App{
val list = (1 to 100000).toList
Util.seqMap(list)
Util.parMap(list)
}
object Util{
def seqMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken =" + (end - start))
end - start
}
def parMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.par.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken=" + (end - start))
end - start
}
}
I expected that running in parallel will be faster. However, the output I was getting was
time taken =32
time taken=127
machine config :
Intel i7 processor with 8 cores
16GB RAM
64bit Windows 8
What am I doing wrong? Is this not a correct scenario for parallel mapping?
The issue is that the operation you are performing is so fast (just adding two ints) that the overhead of doing the parallelization is more than the benefit. The parallelization only really makes sense if the operations are slower.
Think of it this way: if you had 8 friends and you gave each one an integer on a piece of paper and told them to add one, write the result down, and give it back to you, which you would record before giving them the next integer, you'd spend so much time passing messages back and forth that you could have just done all the adding yourself faster.
ALSO: Never do .par on a List because the parallelization procedure has to copy the entire list into a parallel collection and then copy the whole thing back out. If you use a Vector, then it doesn't have to do this extra work.
The overhead in parallelizing the list proves more time-consuming than the actual processing of the x + 1 operations sequentially.
Yet consider this modification where we include an operation that elapses over 1 millisecond approximately,
case class Delay() {
Thread.sleep(1)
}
and replace
list.map(x => x + 1).toList.sum
with
list.map(_ => Delay()).toList
Now for val list = (1 to 10000).toList (note 10000 instead of 100000), in a quadcore 8GB machine,
scala> Util.parMap(list)
time taken=3451
res4: Long = 3451
scala> Util.seqMap(list)
time taken =10816
res5: Long = 10816
We can infer (better, guess) that for large collections with time-consuming operations, the overhead of parallelizing a collection does not significantly affect the elapsed time, in contrast with a sequential collection processing.
If you are doing benchmarks, consider using something like JMH to avoid all the possible problems you might encounter, if you are measuring it in the way your program shows. For example, JIT may change your results dramatically, but only after some iterations.
In my experience parallel collections are normally slower, if the input is not large enough: If the input is small the initial split and the "putting together" at the end does not pay off.
So benchmark again, using lists of different sizes (try 30 000, 100 000, and 1 000 000).
Moreover if you do numerical processing, consider using Array (instead of List) and while (instead of map). These are "more native" (= faster) to the underlying JVM whereas in your case you are possibly measuring the performance of the garbage collector. As for Array you could store the result of the operation "in place".
Parallel collections initialize threads before performing operation that tooks some time.
So when you perform operations by parallel collections with low number of elements or operations takes low time parallel collections will performs slower