Lazy evaluation and deadlocks - scala

Recently I decided to give a try and started to read book "Fast Data Processing Systems with SMACK stack" by Raul Estrada. After 2 first chapters I thought that it is not-so-bad compilation of "hello worlds" unless I've encountered that:
As we saw, lazy evaluation also prevents deadlocks and bottlenecks, because it prevents a
process waiting for the outcome of another process indefinitely.
I was struck with surprise and tried to find any argumentation for that claim that lazy evaluation prevents deadlock. That statement was in regard to Scala and Spark. Unfortunately I didn't find any arguments. As far as I know in order to avoid deadlock you have to ensure that at least one of those will never happen:
Mutual exclusion
Lock & wait
No preemption
Circular wait
How lazy evaluation can prevent any of them?

Lazy evaluation per se doesn't prevent deadlocks, it is however closely tied to another concept, which is computation graph. Since Spark describes computations as a lineage of dependencies, it can verify that the computation graph is acyclic (famous DAG), therefore there are no cases which might cause circular wait.
At the high level Spark enforces this by disallowing nested transformations and actions, which means, that there no hidden dependencies between stages.

The core data structure of Spark i.e RDD is immutable, as a String object in Java. Every time a transformation is applied on an existing RDD a new RDD gets created. This new RDD is represented as a vertex and the applied transformation is represented by a directed edge from the parent RDD to the new RDD. In Lazy Evaluation Spark core adds a new vertex to the DAG every time we apply a transformation on an existing RDD. Spark doesn't executes the transformations right away, an action call triggers the evaluation of RDDs, which is done by executing the DAG (lineage of that particular RDD, on which the action is called). Lazy evaluation makes it impossible to have directed cycles in the lineage graphs of RDDs. Hene possibility of Spark driver getting stuck in an infinite loop is zero. this is what lazy evaluation in Spark is.
Lazy evaluation implementation with DAGs adds benefits of query optimization and fault tolerance in Spark
Now, when it comes to the Deadlock prevention and preventing Mutual exclusion, Lock & wait, No preemption, Circular wait, Which is all about scheduling tasks and allocating resources properly. I think this is a concern of Spark execution environment. Spark Driver does the scheduling of executors processes as they execute the tasks in the worker nodes of the cluster manager.

Related

Flink: No operators defined in streaming topology. Cannot execute

I am trying to setup a very basic flink job. When I try to run, get the following error:
Caused by: java.lang.IllegalStateException: No operators defined in streaming topology. Cannot execute.
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getStreamGraph(StreamExecutionEnvironment.java:1535)
at org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:53)
at org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:654)
at com.test.flink.jobs.TestJobRunnable$.run(TestJob.scala:223)
The error is caused by the code below:
val streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val messageStream = streamExecutionEnvironment.addSource(kafkaConsumer)
messageStream.keyBy(_ => "S")
streamExecutionEnvironment.execute("Test Job")
The error goes away when I add a print() call to the end of the stream:
val streamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val messageStream = streamExecutionEnvironment.addSource(kafkaConsumer)
messageStream.keyBy(_ => "S")
messageStream.print()
streamExecutionEnvironment.execute("Test Job")
I'm confused as to why print() solves this issue. Is the idea that a streaming topology does not process any of its operators until a sink is introduced? Is print() acting as a sink here? Any help would be appreciated. Thanks.
In programming language theory, lazy evaluation, or call-by-need is an evaluation strategy which delays the evaluation of an expression until its value is needed and which also avoids repeated evaluations. The opposite of lazy evaluation is eager evaluation, sometimes known as strict evaluation.
The benefits of lazy evaluation include:
The ability to define control flow (structures) as abstractions
instead of primitives.
The ability to define potentially infinite data structures. This
allows for more straightforward implementation of some algorithms.
Performance increases by avoiding needless calculations, and avoiding
error conditions when evaluating compound expressions.
Lazy evaluation can lead to reduction in memory footprint, since values are created when needed. However, lazy evaluation is difficult to combine with imperative features such as exception handling and input/output, because the order of operations becomes indeterminate.
Generally, Flink divided operations into two class: transformations operations and sink operations. As you guess, Flink transformations are lazy, meaning that they are not executed until a sink operation is invoked.
Flink programs are regular programs that implement transformations on
distributed collections (e.g., filtering, mapping, updating state,
joining, grouping, defining windows, aggregating). Collections are
initially created from sources (e.g., by reading from files, Kafka
topics, or from local, in-memory collections). Results are returned
via sinks, which may, for example, write the data to (distributed)
files, or to standard output (for example, the command line terminal).

Does scala closure feature help Apache spark

I was having a discussion with a colleague the other day and he casually mentioned that other than in-memory computation, closure in Scala is the reason why executing applications on Spark is so efficient. I did find the below text in the official spark docs, but didn't quite understand.
To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.
Any help (pointing to other weblinks, explanation, any references) is highly valued.
The idea behind "calculating the task closure" and sending it to each executor is one of the premises of big data that it is faster/easier to send computation to where the data is rather than ship the data to the computation
TL;DR No. Performance and closure serialization are orthogonal.
The primary advantage of ability to compute and serialize closure (hardly Scala specific feature), is that it enables streamlined programming experience, especially in the interactive mode.
Nonetheless system like Spark could be easily developed without such feature without imposing any performance penalty. A caveat would be that user would have to explicitly state dependencies of each task. There are many examples of project which use such model with good results.

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Using futures in Spark-Streaming & Cassandra (Scala)

I am rather new to spark, and I wonder what is the best practice when using spark-streaming with Cassandra.
Usually, when performing IO, it is a good practice to execute it inside a Future (in Scala).
However, a lot of the spark-cassandra-connector seems to operate synchronously.
For example: saveToCassandra (com.datastax.spark.connector.RDDFunctions)
Is there a good reason why those functions are not async ?
should I wrap them with a Future?
While there are legitimate cases when you can benefit from asynchronous execution of the driver code it is not a general rule. You have to remember that the driver itself is not the place where actual work is performed and Spark execution is a subject of different types of constraints in particular:
scheduling constraints related to resource allocation and DAG topology
batch order in streaming applications
Moreover thinking about the actions like saveToCassandra as IO operation is a significant oversimplification. Spark actions are just entry points for Spark jobs where typically IO activity is just a tip of the iceberg.
If you perform multiple actions per batch and have enough resources to do it without negative impact on individual jobs or you want to perform some type of IO in the driver thread itself then async execution can be useful. Otherwise you probably wasting your time.

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.