I am analysing the performance of my spark application in case of small datasets. I have a lineage graph which looks something like following:
someList.toDS()
.repartition(x)
.mapPartitions(func1)
.mapPartitions(func2)
.mapPartitions(func3)
.filter(cond1)
.count()
I have a cluster of 2 nodes with 8 cores on each. Executors are configured to use 4 cores. So, when the application is running four executors come up using 4 cores each.
I am observing at least (and usually only) 1 task on each thread (i.e. 16 tasks in total) takes a lot longer than other tasks. For example, in one run these tasks are taking approx 15-20 seconds, compared to other tasks running in a second or less.
On profiling the code, I found the bottleneck to be in func3 above:
def func3 = (partition: Iterator[DomainObject]) => {
val l = partition.toList // This takes almost all of the time
val t = doSomething(l)
}
The conversion from an Iterator to a List takes up almost all of the time.
The partition size is very small (even less than 50 in some cases). Even then, the size of partition is almost consistent across different partitions, but only one task per thread takes up the time.
I would have assumed that by the time func3 runs on the executor for a task, the data within that partition would already be present on the executor. Is this not the case? (Does it iterate over the entire dataset to filter out data for this partition somehow, during the execution of func3?!)
Else, why should the conversion from an Iterator over less than fifty objects to a List take up that much time?
Other thing I note (not sure if that is relevant) is the GC time (as per spark ui) for these tasks is also unusually consistent 2s for all of these sixteen tasks, as compared to other tasks (even then, 2s<<20s)
Update:
Following is how the event timeline looks for the four executors:
First realization is during repartition()
Second is after the filter operation, where all the three mapPartitions starts execution (when the count action is called). Depending on your doSomething() in each of those functions it will depend how the DAG will be created and where it is taking time and accordingly you can optimize.
It appears that the data in the partition is available as soon as the task starts executing (or, at least there is not any significant cost in iterating through that data, as the question would make it seem.)
The bottleneck in above code is actually in func2 (which I did not investigate properly!), and is because of the lazy nature of the iterators in scala. The problem is not related to spark at all.
Firstly, the functions in the mapPartitions calls above appear to get chained and called like so: func3( func2( func1(Iterator[A]) ) ) : Iterator[B]. So the Iterator produced as output of func2 is fed to func3 directly.
Secondly, for above issue func1 (and func2) are defined as :
func1(x: Iterator[A]) => Iterator[B] = x.map(...).filter...
Since these take an iterator and map them to a different iterator, these are not executed right away. But when func3 is executed, partition.toList causes to map closure in func2 to get executed. On profiling, it appears that func3 took all the time, where instead func2 has the code slowing the application.
(Specific to above problem, func2 contains some serialising of case objects to a json string. It appears to execute some time-consuming implicit code, only for the first object on each thread. Since it happens once for each thread, each thread has just one task which takes very long, and explains the event timeline above.)
Related
Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.
We are running the following stage DAG and experiencing long shuffle read time for relatively small shuffle data sizes (about 19MB per task)
One interesting aspect is that waiting tasks within each executor/server have equivalent shuffle read time. Here is an example of what it means: for the following server one group of tasks waits about 7.7 minutes and another one waits about 26 s.
Here is another example from the same stage run. The figure shows 3 executors / servers each having uniform groups of tasks with equal shuffle read time. The blue group represents killed tasks due to speculative execution:
Not all executors are like that. There are some that finish all their tasks within seconds pretty much uniformly, and the size of remote read data for these tasks is the same as for the ones that wait long time on other servers.
Besides, this type of stage runs 2 times within our application runtime. The servers/executors that produce these groups of tasks with large shuffle read time are different in each stage run.
Here is an example of task stats table for one of the severs / hosts:
It looks like the code responsible for this DAG is the following:
output.write.parquet("output.parquet")
comparison.write.parquet("comparison.parquet")
output.union(comparison).write.parquet("output_comparison.parquet")
val comparison = data.union(output).except(data.intersect(output)).cache()
comparison.filter(_.abc != "M").count()
We would highly appreciate your thoughts on this.
Apparently the problem was JVM garbage collection (GC). The tasks had to wait until GC is done on the remote executors. The equivalent shuffle read time resulted from the fact that several tasks were waiting on a single remote host performing GC. We followed advise posted here and the problem decreased by an order of magnitude. There is still small correlation between GC time on remote hosts and local shuffle read time. In the future we think to try shuffle service.
Since google brought me here with the same problem but I needed another solution...
Another possible reason for small shuffle size taking a long time to read could be the data is split over many partitions. For example (apologies this is pyspark as it is all I have used):
my_df_with_many_partitions\ # say has 1000 partitions
.filter(very_specific_filter)\ # only very few rows pass
.groupby('blah')\
.count()
The shuffle write from the filter above will be very small, so for the stage after we will have a very small amount to read. But to read it you need to check a lot of empty partitions. One way to address this would be:
my_df_with_many_partitions\
.filter(very_specific_filter)\
.repartition(1)\
.groupby('blah')\
.count()
I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?
Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.
In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)
I have scenarios where I will need to process thousands of records at a time. Sometime, it might be in hundreds, may be upto 30000 records. I was thinking of using the scala's parallel collection. So just to understand the difference, I wrote a simple pgm like below:
object Test extends App{
val list = (1 to 100000).toList
Util.seqMap(list)
Util.parMap(list)
}
object Util{
def seqMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken =" + (end - start))
end - start
}
def parMap(list:List[Int]) = {
val start = System.currentTimeMillis
list.par.map(x => x + 1).toList.sum
val end = System.currentTimeMillis
println("time taken=" + (end - start))
end - start
}
}
I expected that running in parallel will be faster. However, the output I was getting was
time taken =32
time taken=127
machine config :
Intel i7 processor with 8 cores
16GB RAM
64bit Windows 8
What am I doing wrong? Is this not a correct scenario for parallel mapping?
The issue is that the operation you are performing is so fast (just adding two ints) that the overhead of doing the parallelization is more than the benefit. The parallelization only really makes sense if the operations are slower.
Think of it this way: if you had 8 friends and you gave each one an integer on a piece of paper and told them to add one, write the result down, and give it back to you, which you would record before giving them the next integer, you'd spend so much time passing messages back and forth that you could have just done all the adding yourself faster.
ALSO: Never do .par on a List because the parallelization procedure has to copy the entire list into a parallel collection and then copy the whole thing back out. If you use a Vector, then it doesn't have to do this extra work.
The overhead in parallelizing the list proves more time-consuming than the actual processing of the x + 1 operations sequentially.
Yet consider this modification where we include an operation that elapses over 1 millisecond approximately,
case class Delay() {
Thread.sleep(1)
}
and replace
list.map(x => x + 1).toList.sum
with
list.map(_ => Delay()).toList
Now for val list = (1 to 10000).toList (note 10000 instead of 100000), in a quadcore 8GB machine,
scala> Util.parMap(list)
time taken=3451
res4: Long = 3451
scala> Util.seqMap(list)
time taken =10816
res5: Long = 10816
We can infer (better, guess) that for large collections with time-consuming operations, the overhead of parallelizing a collection does not significantly affect the elapsed time, in contrast with a sequential collection processing.
If you are doing benchmarks, consider using something like JMH to avoid all the possible problems you might encounter, if you are measuring it in the way your program shows. For example, JIT may change your results dramatically, but only after some iterations.
In my experience parallel collections are normally slower, if the input is not large enough: If the input is small the initial split and the "putting together" at the end does not pay off.
So benchmark again, using lists of different sizes (try 30 000, 100 000, and 1 000 000).
Moreover if you do numerical processing, consider using Array (instead of List) and while (instead of map). These are "more native" (= faster) to the underlying JVM whereas in your case you are possibly measuring the performance of the garbage collector. As for Array you could store the result of the operation "in place".
Parallel collections initialize threads before performing operation that tooks some time.
So when you perform operations by parallel collections with low number of elements or operations takes low time parallel collections will performs slower
I'm trying to understand rational behind the statement
For cases where blocking is absolutely necessary, futures can be blocked on (although it is discouraged)
Idea behind ForkJoinPool is to join processes which is blocking operation, and this is main implementation of executor context for futures and actors. It should be effective for blocking joins.
I wrote small benchmark and seems like old style futures(scala 2.9) are 2 times faster in this very simple scenario.
#inline
def futureResult[T](future: Future[T]) = Await.result(future, Duration.Inf)
#inline
def futureOld[T](body: => T)(implicit ctx:ExecutionContext): () => T = {
val f = future(body)
() => futureResult(f)
}
def main(args: Array[String]) {
#volatile
var res = 0d
CommonUtil.timer("res1") {
(0 until 100000).foreach { i =>
val f1 = futureOld(math.exp(1))
val f2 = futureOld(math.exp(2))
val f3 = futureOld(math.exp(3))
res = res + f1() + f2() + f3()
}
}
println("res1 = "+res)
res = 0
res = 0
CommonUtil.timer("res1") {
(0 until 100000).foreach { i =>
val f1 = future(math.exp(1))
val f2 = future(math.exp(2))
val f3 = future(math.exp(3))
val f4 = for(r1 <- f1; r2 <- f2 ; r3 <- f3) yield r1+r2+r3
res = res + futureResult(f4)
}
}
println("res2 = "+res)
}
start:res1
res1 - 1.683 seconds
res1 = 3019287.4850644027
start:res1
res1 - 3.179 seconds
res2 = 3019287.485058338
Most of the point of Futures is that they enable you to create non-blocking, concurrent code that can easily be executed in parallel.
OK, so wrapping a potentially lengthy function in a future returns immediately, so that you can postpone worrying about the return value until you are actually interested in it. But if the part of the code which does care about the value just blocks until the result is actually available, all you gained was a way to make your code a little tidier (and you know, you can do that without futures - using futures to tidy up your code would be a code smell, I think). Unless the functions being wrapped in futures are absolutely trivial, your code is going to spend much more time blocking than evaluating other expressions.
If, on the other hand, you register a callback (e.g. using onComplete or onSuccess) and put in that callback the code which cares about the result, then you can have code which can be organised to run very efficiently and scale well. It becomes event driven rather than having to sit and wait for results.
Your benchmark is of the former type, but since you have some tiny functions there, there is very little to gain between executing them in parallel compared to in sequence. This means that you are mostly evaluating the overhead of creating and accessing the futures. Congratulations: you showed that in some circumstances 2.9 futures are faster at doing something trivial than 2.10 - something trivial which does not really play to the strengths of either version of the concept.
Try something a little more complex and demanding. I mean, you're requesting the future values almost immediately! At the very least, you could build an array of 100000 futures, then pull out their results in another loop. That would be testing something slightly meaningful. Oh, and have them compute something based on the value of i.
You could progress from there to
Creating an object to store the results.
Registering a callback with each future that inserts the result into the object.
Launching your n calculations
And then benchmarking how long it takes for the actual results to arrive, when you demand them all. That would be rather more meaningful.
EDIT
By the way, your benchmark fails both on its own terms and in its understanding of the proper use of futures.
Firstly, you are counting the time it takes to retrieve each individual future result, but not the actual time it takes to evaluate res once all 3 futures have been created, nor the total time it takes to iterate through the loop. Also, your mathematical calculations are so trivial that you might actually be testing the penalty in the second test of a) the for comprehension and b) the fourth future in which the first three futures are wrapped.
Secondly, the only reason these sums probably add up to something roughly proportional to the overall time used is precisely because there is really no concurrency here.
I'm not trying to beat up on you, it's just that these flaws in the benchmark help illuminate the issue. A proper benchmark of the performance of different futures implementations would require very careful thought.
Java7 docs for ForkJoinTask reports:
A ForkJoinTask is a lightweight form of Future. The efficiency of
ForkJoinTasks stems from a set of restrictions (that are only
partially statically enforceable) reflecting their intended use as
computational tasks calculating pure functions or operating on purely
isolated objects. The primary coordination mechanisms are fork(), that
arranges asynchronous execution, and join(), that doesn't proceed
until the task's result has been computed. Computations should avoid
synchronized methods or blocks, and should minimize other blocking
synchronization apart from joining other tasks or using synchronizers
such as Phasers that are advertised to cooperate with fork/join
scheduling. Tasks should also not perform blocking IO, and should
ideally access variables that are completely independent of those
accessed by other running tasks. Minor breaches of these restrictions,
for example using shared output streams, may be tolerable in practice,
but frequent use may result in poor performance, and the potential to
indefinitely stall if the number of threads not waiting for IO or
other external synchronization becomes exhausted. This usage
restriction is in part enforced by not permitting checked exceptions
such as IOExceptions to be thrown. However, computations may still
encounter unchecked exceptions, that are rethrown to callers
attempting to join them. These exceptions may additionally include
RejectedExecutionException stemming from internal resource exhaustion,
such as failure to allocate internal task queues. Rethrown exceptions
behave in the same way as regular exceptions, but, when possible,
contain stack traces (as displayed for example using
ex.printStackTrace()) of both the thread that initiated the
computation as well as the thread actually encountering the exception;
minimally only the latter.
Doug Lea's maintenance repository for JSR166 (targeted at JDK8) expands on this:
A ForkJoinTask is a lightweight form of Future. The efficiency of
ForkJoinTasks stems from a set of restrictions (that are only
partially statically enforceable) reflecting their main use as
computational tasks calculating pure functions or operating on purely
isolated objects. The primary coordination mechanisms are fork(), that
arranges asynchronous execution, and join(), that doesn't proceed
until the task's result has been computed. Computations should ideally
avoid synchronized methods or blocks, and should minimize other
blocking synchronization apart from joining other tasks or using
synchronizers such as Phasers that are advertised to cooperate with
fork/join scheduling. Subdividable tasks should also not perform
blocking I/O, and should ideally access variables that are completely
independent of those accessed by other running tasks. These guidelines
are loosely enforced by not permitting checked exceptions such as
IOExceptions to be thrown. However, computations may still encounter
unchecked exceptions, that are rethrown to callers attempting to join
them. These exceptions may additionally include
RejectedExecutionException stemming from internal resource exhaustion,
such as failure to allocate internal task queues. Rethrown exceptions
behave in the same way as regular exceptions, but, when possible,
contain stack traces (as displayed for example using
ex.printStackTrace()) of both the thread that initiated the
computation as well as the thread actually encountering the exception;
minimally only the latter.
It is possible to define and use ForkJoinTasks that may block, but
doing do requires three further considerations: (1) Completion of few
if any other tasks should be dependent on a task that blocks on
external synchronization or I/O. Event-style async tasks that are
never joined (for example, those subclassing CountedCompleter) often
fall into this category. (2) To minimize resource impact, tasks should
be small; ideally performing only the (possibly) blocking action. (3)
Unless the ForkJoinPool.ManagedBlocker API is used, or the number of
possibly blocked tasks is known to be less than the pool's
ForkJoinPool.getParallelism() level, the pool cannot guarantee that
enough threads will be available to ensure progress or good
performance.
tl;dr;
The "blocking join" operation referred to by the fork-join is not to be confused with calling some "blocking code" within the task.
The first is about coordinating many independent tasks (which are not independent threads) to collect individual outcomes and evaluate an overall result.
The second is about calling a potentially long-blocking operation within the single tasks: e.g. IO operations over the network, a DB query, accessing the file system, accessing a globally synchronized object or method...
The second kind of blocking is discouraged for Scala Futures and ForkJoinTasks both.
The main risk is that the thread-pool gets exhausted and is unable to complete tasks awaiting in the queue, while all available threads are busy waiting on blocking operations.