Does scala closure feature help Apache spark - scala

I was having a discussion with a colleague the other day and he casually mentioned that other than in-memory computation, closure in Scala is the reason why executing applications on Spark is so efficient. I did find the below text in the official spark docs, but didn't quite understand.
To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.
Any help (pointing to other weblinks, explanation, any references) is highly valued.

The idea behind "calculating the task closure" and sending it to each executor is one of the premises of big data that it is faster/easier to send computation to where the data is rather than ship the data to the computation

TL;DR No. Performance and closure serialization are orthogonal.
The primary advantage of ability to compute and serialize closure (hardly Scala specific feature), is that it enables streamlined programming experience, especially in the interactive mode.
Nonetheless system like Spark could be easily developed without such feature without imposing any performance penalty. A caveat would be that user would have to explicitly state dependencies of each task. There are many examples of project which use such model with good results.

Related

Lazy evaluation and deadlocks

Recently I decided to give a try and started to read book "Fast Data Processing Systems with SMACK stack" by Raul Estrada. After 2 first chapters I thought that it is not-so-bad compilation of "hello worlds" unless I've encountered that:
As we saw, lazy evaluation also prevents deadlocks and bottlenecks, because it prevents a
process waiting for the outcome of another process indefinitely.
I was struck with surprise and tried to find any argumentation for that claim that lazy evaluation prevents deadlock. That statement was in regard to Scala and Spark. Unfortunately I didn't find any arguments. As far as I know in order to avoid deadlock you have to ensure that at least one of those will never happen:
Mutual exclusion
Lock & wait
No preemption
Circular wait
How lazy evaluation can prevent any of them?
Lazy evaluation per se doesn't prevent deadlocks, it is however closely tied to another concept, which is computation graph. Since Spark describes computations as a lineage of dependencies, it can verify that the computation graph is acyclic (famous DAG), therefore there are no cases which might cause circular wait.
At the high level Spark enforces this by disallowing nested transformations and actions, which means, that there no hidden dependencies between stages.
The core data structure of Spark i.e RDD is immutable, as a String object in Java. Every time a transformation is applied on an existing RDD a new RDD gets created. This new RDD is represented as a vertex and the applied transformation is represented by a directed edge from the parent RDD to the new RDD. In Lazy Evaluation Spark core adds a new vertex to the DAG every time we apply a transformation on an existing RDD. Spark doesn't executes the transformations right away, an action call triggers the evaluation of RDDs, which is done by executing the DAG (lineage of that particular RDD, on which the action is called). Lazy evaluation makes it impossible to have directed cycles in the lineage graphs of RDDs. Hene possibility of Spark driver getting stuck in an infinite loop is zero. this is what lazy evaluation in Spark is.
Lazy evaluation implementation with DAGs adds benefits of query optimization and fault tolerance in Spark
Now, when it comes to the Deadlock prevention and preventing Mutual exclusion, Lock & wait, No preemption, Circular wait, Which is all about scheduling tasks and allocating resources properly. I think this is a concern of Spark execution environment. Spark Driver does the scheduling of executors processes as they execute the tasks in the worker nodes of the cluster manager.

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Using futures in Spark-Streaming & Cassandra (Scala)

I am rather new to spark, and I wonder what is the best practice when using spark-streaming with Cassandra.
Usually, when performing IO, it is a good practice to execute it inside a Future (in Scala).
However, a lot of the spark-cassandra-connector seems to operate synchronously.
For example: saveToCassandra (com.datastax.spark.connector.RDDFunctions)
Is there a good reason why those functions are not async ?
should I wrap them with a Future?
While there are legitimate cases when you can benefit from asynchronous execution of the driver code it is not a general rule. You have to remember that the driver itself is not the place where actual work is performed and Spark execution is a subject of different types of constraints in particular:
scheduling constraints related to resource allocation and DAG topology
batch order in streaming applications
Moreover thinking about the actions like saveToCassandra as IO operation is a significant oversimplification. Spark actions are just entry points for Spark jobs where typically IO activity is just a tip of the iceberg.
If you perform multiple actions per batch and have enough resources to do it without negative impact on individual jobs or you want to perform some type of IO in the driver thread itself then async execution can be useful. Otherwise you probably wasting your time.

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.

What is the benefit of using Futures over parallel collections in scala?

Is there a good reason for the added complexity of Futures (vs parallel collections) when processing a list of items in parallel?
List(...).par.foreach(x=>longRunningAction(x))
vs
Future.traverse(List(...)) (x=>Future(longRunningAction(x)))
I think the main advantage would be that you can access the results of each future as soon as it is computed, while you would have to wait for the whole computation to be done with a parallel collection. A disadvantage might be that you end up creating lots of futures. If you later end up calling Future.sequence, there really is no advantage.
Parallel collections will kill of some threads as we get closer to processing all items. So last few items might be processed by single thread.
Please see my question for more details on this behavior Using ThreadPoolTaskSupport as tasksupport for parallel collections in scala
Future does no such thing, and all your threads are in use until all objects are processed. Hence unless your tasks are so small that you dont care about loss of parallelism for last few tasks and you are using huge number of threads, which have to killed of as soon as possible, Futures are better.
Futures become useful as soon as you want to compose your deferred / concurrent computations. Futures (the good kind, anyway, such as Akka's) are monadic and hence allow you to build arbitrarily complex computational structures with all the concurrency and synchronization handled properly by the Futures library.