what is a dead executor in spark(Pyspark) - pyspark

I am running a pyspark job where i noticed that there are dead executors, but the job finally succeeded.
What is the impact of this when an executor is dead.
What happens when executor, after completing the task, looses its connection and kills itself. Does the task is again initiated by getting new executor resulting in two time execution of the same task ?
Please help me to understand this.

For starters I think just taking Wikipedia's spark page should be enough:
https://en.m.wikipedia.org/wiki/Apache_Spark
Basically spark does the data processing in a fault-tolerant manner, which basically means, that some approaches to evaluate your dataframe might fail, some might succeed, and there's nothing bad about it. After failure it should just self-improve. Unless there's too many failures, and it fails whole job.
Spark knows intelligently how to process the data, yet sometimes you can (or have to in case of too many failures) help it by more efficient manual configuration like e.g. setting spark context with proper parameters, or smarter query organisation (rethink joins, broadcasting, maybe breaking down bigger tables into smaller chunks).
From my experience- the better configured job = less failures, and faster processing. It sometimes takes a lot of time to understand how to make your processing more efficient, yet you should aim for less failures and faster processing as the 2 indicators of better spark job.
Hope this helps

Related

Spark shuffle read takes significant time for small data

We are running the following stage DAG and experiencing long shuffle read time for relatively small shuffle data sizes (about 19MB per task)
One interesting aspect is that waiting tasks within each executor/server have equivalent shuffle read time. Here is an example of what it means: for the following server one group of tasks waits about 7.7 minutes and another one waits about 26 s.
Here is another example from the same stage run. The figure shows 3 executors / servers each having uniform groups of tasks with equal shuffle read time. The blue group represents killed tasks due to speculative execution:
Not all executors are like that. There are some that finish all their tasks within seconds pretty much uniformly, and the size of remote read data for these tasks is the same as for the ones that wait long time on other servers.
Besides, this type of stage runs 2 times within our application runtime. The servers/executors that produce these groups of tasks with large shuffle read time are different in each stage run.
Here is an example of task stats table for one of the severs / hosts:
It looks like the code responsible for this DAG is the following:
output.write.parquet("output.parquet")
comparison.write.parquet("comparison.parquet")
output.union(comparison).write.parquet("output_comparison.parquet")
val comparison = data.union(output).except(data.intersect(output)).cache()
comparison.filter(_.abc != "M").count()
We would highly appreciate your thoughts on this.
Apparently the problem was JVM garbage collection (GC). The tasks had to wait until GC is done on the remote executors. The equivalent shuffle read time resulted from the fact that several tasks were waiting on a single remote host performing GC. We followed advise posted here and the problem decreased by an order of magnitude. There is still small correlation between GC time on remote hosts and local shuffle read time. In the future we think to try shuffle service.
Since google brought me here with the same problem but I needed another solution...
Another possible reason for small shuffle size taking a long time to read could be the data is split over many partitions. For example (apologies this is pyspark as it is all I have used):
my_df_with_many_partitions\ # say has 1000 partitions
.filter(very_specific_filter)\ # only very few rows pass
.groupby('blah')\
.count()
The shuffle write from the filter above will be very small, so for the stage after we will have a very small amount to read. But to read it you need to check a lot of empty partitions. One way to address this would be:
my_df_with_many_partitions\
.filter(very_specific_filter)\
.repartition(1)\
.groupby('blah')\
.count()

Using futures in Spark-Streaming & Cassandra (Scala)

I am rather new to spark, and I wonder what is the best practice when using spark-streaming with Cassandra.
Usually, when performing IO, it is a good practice to execute it inside a Future (in Scala).
However, a lot of the spark-cassandra-connector seems to operate synchronously.
For example: saveToCassandra (com.datastax.spark.connector.RDDFunctions)
Is there a good reason why those functions are not async ?
should I wrap them with a Future?
While there are legitimate cases when you can benefit from asynchronous execution of the driver code it is not a general rule. You have to remember that the driver itself is not the place where actual work is performed and Spark execution is a subject of different types of constraints in particular:
scheduling constraints related to resource allocation and DAG topology
batch order in streaming applications
Moreover thinking about the actions like saveToCassandra as IO operation is a significant oversimplification. Spark actions are just entry points for Spark jobs where typically IO activity is just a tip of the iceberg.
If you perform multiple actions per batch and have enough resources to do it without negative impact on individual jobs or you want to perform some type of IO in the driver thread itself then async execution can be useful. Otherwise you probably wasting your time.

Scala task parallelization with actors => How does the scheduler work?

I have a task which can be easily be broken into parts which can and should be processed in parallel to optimize performance.
I wrote an producer actor which prepares each part of the task that could be processed independently. This preparation is relatively cheap.
I wrote a consumer Actor that processes each of the independent tasks. Depending on the parameters each piece of independent task may take up to a couple of seconds to be processed. All tasks are quite the same. They all process the same algorithm, with the same amount of data (but different values of course) resulting in about equal time of processing.
So the producer is much faster than the consumer. Hence there quickly may be 200 or 2000 tasks prepared (depending on the parameters). All of them consuming memory while just a couple of them can be executed at at once.
Now I see two simple strategies to consume and process the tasks:
Create a new consumer actor instance for each task.
Each consumer processes only on task.
I assume there would be many consumer actor instances at the same time, while only a couple of them, can be processed at any point in time.
How does the default scheduler work? Can each consumer actor finish processing before the next consumer will be scheduled? Or will a consumer be interrupted and be replaced by another consumer resulting in longer time until the first task will be finished? I think this actor scheduling is not the same as process or thread scheduling, but I can imagine, that interruption can still have some disadvantages (e.g. like more cache misses).
The other strategy is to use N instances of the consumer actor and send the tasks to process as messages to them.
Each consumer processes multiple tasks in sequence.
It is left up to me, to find a appropriate value for the N (number of consumers).
The distribution of the tasks over the N consumers is also left up to me.
I could imagine a more sophisticated solution where more coordination is done between the producer and the consumers, but I can't make a good decision without knowledge about the scheduler.
If manual solution will not result in significant better performance, I would prefer a default solution (delivered by some part of the Scala world), where scheduling tasks are not left up to me (like strategy 1).
Question roundup:
How does the default scheduler work?
Can each consumer actor finish processing before the next consumer will be scheduled?
Or will a consumer be interrupted and be replaced by another consumer resulting in longer time until the first task will be finished?
What are the disadvantages when the scheduler frequently interrupts an actor and schedules another one? Cache-Misses?
Would this interruption and scheduling be like a context-change in process scheduling or thread scheduling?
Are there any more advantages or disadvantages comparing these strategies?
Especially does strategy 1 have disadvantages over strategy 2?
Which of these strategies is the best?
Is there a better strategy than I proposed?
I'm afraid, that questions like the last two can not be answered absolutely, but maybe this is possible this time as I tried to give a case as concrete as possible.
I think the other questions can be answered without much discussion. With those answers it should be possible to choose the strategy fitting the requirements best.
I made some research and thoughts myself and came up with some assumptions. If any of these assumptions are wrong, please tell me.
If I were you, I would have gone ahead with 2nd option. A new actor instance for each task would be too tedious. Also with smart decision of N, complete system resources can be used.
Though this is not a complete solution. But one possible option is that, can't the producer stop/slow down the rate of producing tasks? This would be ideal. Only when there is a consumer available or something, the producer will produce more tasks.
Assuming you are using Akka (if you don't, you should ;-) ), you could use a SmallestMailboxRouter to start a number of actors (you can also add a Resizer) and the message distribution will be handled according to some rules. You can read everything about routers here.
For such a simple task, actors give no profit at all. Implement the producer as a Thread, and each task as a Runnable. Use a thread pool from java.util.concurrent to run the tasks. Use a java.util.concurrent. Semaphore to limit the number of prepared and running tasks: before creating the next tasks, producer aquires the sempahore, and each task releases the semaphore at the end of its execution.

How to limit concurrency when using actors in Scala?

I'm coming from Java, where I'd submit Runnables to an ExecutorService backed by a thread pool. It's very clear in Java how to set limits to the size of the thread pool.
I'm interested in using Scala actors, but I'm unclear on how to limit concurrency.
Let's just say, hypothetically, that I'm creating a web service which accepts "jobs". A job is submitted with POST requests, and I want my service to enqueue the job then immediately return 202 Accepted — i.e. the jobs are handled asynchronously.
If I'm using actors to process the jobs in the queue, how can I limit the number of simultaneous jobs that are processed?
I can think of a few different ways to approach this; I'm wondering if there's a community best practice, or at least, some clearly established approaches that are somewhat standard in the Scala world.
One approach I've thought of is having a single coordinator actor which would manage the job queue and the job-processing actors; I suppose it could use a simple int field to track how many jobs are currently being processed. I'm sure there'd be some gotchyas with that approach, however, such as making sure to track when an error occurs so as to decrement the number. That's why I'm wondering if Scala already provides a simpler or more encapsulated approach to this.
BTW I tried to ask this question a while ago but I asked it badly.
Thanks!
I'd really encourage you to have a look at Akka, an alternative Actor implementation for Scala.
http://www.akkasource.org
Akka already has a JAX-RS[1] integration and you could use that in concert with a LoadBalancer[2] to throttle how many actions can be done in parallell:
[1] http://doc.akkasource.org/rest
[2] http://github.com/jboner/akka/blob/master/akka-patterns/src/main/scala/Patterns.scala
You can override the system properties actors.maxPoolSize and actors.corePoolSize which limit the size of the actor thread pool and then throw as many jobs at the pool as your actors can handle. Why do you think you need to throttle your reactions?
You really have two problems here.
The first is keeping the thread pool used by actors under control. That can be done by setting the system property actors.maxPoolSize.
The second is runaway growth in the number of tasks that have been submitted to the pool. You may or may not be concerned with this one, however it is fully possible to trigger failure conditions such as out of memory errors and in some cases potentially more subtle problems by generating too many tasks too fast.
Each worker thread maintains a dequeue of tasks. The dequeue is implemented as an array that the worker thread will dynamically enlarge up to some maximum size. In 2.7.x the queue can grow itself quite large and I've seen that trigger out of memory errors when combined with lots of concurrent threads. The max dequeue size is smaller 2.8. The dequeue can also fill up.
Addressing this problem requires you control how many tasks you generate, which probably means some sort of coordinator as you've outlined. I've encountered this problem when the actors that initiate a kind of data processing pipeline are much faster than ones later in the pipeline. In order control the process I usually have the actors later in the chain ping back actors earlier in the chain every X messages, and have the ones earlier in the chain stop after X messages and wait for the ping back. You could also do it with a more centralized coordinator.

Does it make sense to use a pool of Actors?

I'm just learning, and really liking, the Actor pattern. I'm using Scala right now, but I'm interested in the architectural style in general, as it's used in Scala, Erlang, Groovy, etc.
The case I'm thinking of is where I need to do things concurrently, such as, let's say "run a job".
With threading, I would create a thread pool and a blocking queue, and have each thread poll the blocking queue, and process jobs as they came in and out of the queue.
With actors, what's the best way to handle this? Does it make sense to create a pool of actors, and somehow send messages to them containing or the jobs? Maybe with a "coordinator" actor?
Note: An aspect of the case which I forgot to mention was: what if I want to constrain the number of jobs my app will process concurrently? Maybe with a config setting? I was thinking that a pool might make it easy to do this.
Thanks!
A pool is a mechanism you use when the cost of creating and tearing down a resource is high. In Erlang this is not the case so you should not maintain a pool.
You should spawn processes as you need them and destroy them when you have finished with them.
Sometimes, it makes sense to limit how many working processes you have operating concurrently on a large task list, as the task the process is spawned to complete involve resource allocations. At the very least processes use up memory, but they could also keep open files and/or sockets which tend to be limited to only thousands and fail miserably and unpredictable once you run out.
To have a pull-driven task pool, one can spawn N linked processes that ask for a task, and one hand them a function they can spawn_monitor. As soon as the monitored process has ended, they come back for the next task. Specific needs drive the details, but that is the outline of one approach.
The reason I would let each task spawn a new process is that processes do have some state and it is nice to start off a clean slate. It's a common fine-tuning to set the min-heap size of processes adjusted to minimize the number of GCs needed during its lifetime. It is also a very efficient garbage collection to free all memory for a process and start on a new one for the next task.
Does it feel weird to use twice the number of processes like that? It's a feeling you need to overcome in Erlang programming.
There is no best way for all cases. The decision depends on the number, duration, arrival, and required completion time of the jobs.
The most obvious difference between just spawning off actors, and using pools is that in the former case your jobs will be finished nearly at the same time, while in the latter case completion times will be spread in time. The average completion time will be the same though.
The advantage of using actors is the simplicity on coding, as it requires no extra handling. The trade-off is that your actors will be competing for your CPU cores. You will not be able to have more parallel jobs than CPU cores (or HT's, whatever), no matter what programming paradigm you use.
As an example, imagine that you need to execute 100'000 jobs, each taking one minute, and the results are due next month. You have four cores. Would you spawn off 100'000 actors having each compete over the resources for a month, or would you just queue your jobs up, and have execute four at a time?
As a counterexample, imagine a web server running on the same machine. If you have five requests, would you prefer to serve four users in T time, and one in 2T, or serve all five in 1.2T time ?