Are spark variables lazily evaluated? - scala

I have a spark code of structure:
val a:RDD = readData.someOperations()
a.cache()
val b = a.someOperations1()
val c = a.someOperations2()
val d = a.someOperations3()
val e = a.someOperations4()
a.unpersist()
some other code in many more RDDs(other RDDs are cached in this section and other vals are evaluated).
write variable to disk(a,b,c,d,e and others)
I wanted to know if the varibales are calculated in the place they are defined or only when writing to disk. I fear if they are evaluated only while writing to disk then I will be caching many more RDDs at same time.

Yes. You are correct. All the transformations on RDD are lazily evaluated until an action is done like collect() save() etc
All the transformation operations like map() reduce() generate physical and logical execution plans which are performed by tracking the parent plans when an action is performed.
You can checkout JerryLead and JacekLaskowski for more details.
I hope this is helpful

Related

How to properly apply HashPartitioner before a join in Spark?

To reduce shuffling during the joining of two RDDs, I decided to partition them using HashPartitioner first. Here is how I do it. Am I doing it correctly, or is there a better way to do this?
val rddA = ...
val rddB = ...
val numOfPartitions = rddA.getNumPartitions
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions))
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions))
val rddAB = rddApartitioned.join(rddBpartitioned)
To reduce shuffling during the joining of two RDDs,
It is surprisingly common misconception that repartitoning reduces or even eliminates shuffles. It doesn't. Repartitioning is shuffle in its purest form. It doesn't save time, bandwidth or memory.
The rationale behind using proactive partitioner is different - it allows you to shuffle once, and reuse the state, to perform multiple by-key operations, without additional shuffles (though as far as I am aware, not necessarily without additional network traffic, as co-partitioning doesn't imply co-location, excluding cases where shuffles occurred in a single actions).
So your code is correct, but in a case where you join once it doesn't buy you anything.
Just one comment, better to append .persist() after .partitionBy if there are multiple actions for rddApartitioned and rddBpartitioned, otherwise, all the actions will evaluate the entire lineage of rddApartitioned and rddBpartitioned, which will cause the hash-partitioning takes place again and again.
val rddApartitioned = rddA.partitionBy(new HashPartitioner(numOfPartitions)).persist()
val rddBpartitioned = rddB.partitionBy(new HashPartitioner(numOfPartitions)).persist()

Is a filtered RDD still in cache when performed on a cached RDD

I'm wondering if we perform the following instructions :
val rdd : = sc.textFile("myfile").zipwithIndex.cache
val size = rdd.count
val filter = rdd.filter(_._2 % 2 == 0)
val sizeF = filter.count
The action performed on the filter RDD is execute as if it is in cache or not ? Despite the fact we create a second RDD from the first one, the information came from the same place, so i'm wondering if it is copied into a new object that needs to be cached or if the filtered object is directly linked to his parent allowing faster actions ?
Since filter is a transformation and not an action, and since spark is lazy nothing was actually done in the following line:
val filter = rdd.filter(_._2 % 2 == 0)
The following line:
val sizeF = filter.count
Will use the cached() rdd, and will perform the filter transformation followed by the count action
Hence, there is nothing to cache in the filter transformation.
Spark Guide
Transformations
The following table lists some of the common transformations supported
by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair
RDD functions doc (Scala, Java) for details.
filter(func) Return a new dataset formed by selecting those elements
of the source on which func returns true.
Note. if filter was an action, and a new RDD was created, it wouldn't be cached, only the RDDs which the cache() operation was executed on them are cached.
No,
The child RDD will not be cahced, the cache will mantain the original RDD in your workers and the other data will not be cached.
If you request for this filtered RDD other step that doesn't change the data, the response always will be fast due to Spark keep the Spark Files in workers until a real change.

Split Spark DataFrame based on condition

I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.
Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))
I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.

Actions on the same apache Spark RDD cause all statements re-execution

I am using Apache Spark to process a huge amount of data. I need to execute many Spark actions on the same RDD. My code looks like the following:
val rdd = /* Get the rdd using the SparkContext */
val map1 = rdd.map(/* Some transformation */)
val map2 = map1.map(/* Some other transformation */)
map2.count
val map3 = map2.map(/* More transformation */)
map3.count
The problem is that calling the second action map3.count forces the re-execution of the transformations rdd.map and map1.map.
What the hell is going on? I think the DAG built by Spark is responible of this behaviour.
This is an expected behavior. Unless one of the ancestor can be fetched from cache (typically it means that is has been persisted explicitly or implicitly during shuffle) every action will recompute a whole lineage.
Recomputation can be also triggered if RDD has been persisted but data has been lost / removed from cache or amount of available space is to low to store all records.
In this particular case you should cache in a following order
...
val map2 = map1.map(/* Some other transformation */)
map2.cache
map2.count
val map3 = map2.map(/* More transformation */)
...
if you want to avoid repeated evaluation of rdd, map1 and map2.

Spark: Single pipelined scala command better than separate commands?

I am using Spark with scala. I wanted to know if having single one line command better than separate commands? What are the benefits if any? Does it gain more efficiency in terms of speed? Why?
for e.g.
var d = data.filter(_(1)==user).map(f => (f(2),f(5).toInt)).groupByKey().map(f=> (f._1,f._2.count(x=>true), f._2.sum))
against
var a = data.filter(_(1)==user)
var b = a.map(f => (f(2),f(5).toInt))
var c = b.groupByKey()
var d = c.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
There is no performance difference between your two examples; the decision to chain RDD transformations or to explicitly represent the intermediate RDDs is just a matter of style. Spark's lazy evaluation means that no actual distributed computation will be performed until you invoke an RDD action like take() or count().
During execution, Spark will pipeline as many transformations as possible. For your example, Spark won't materialize the entire filtered dataset before it maps it: the filter() and map() transformations will be pipelined together and executed in a single stage. The groupByKey() transformation (usually) needs to shuffle data over the network, so it's executed in a separate stage. Spark would materialize the output of filter() only if it had been cache()d.
You might need to use the second style if you want to cache an intermediate RDD and perform further processing on it. For example, if I wanted to perform multiple actions on the output of the groupByKey() transformation, I would write something like
val grouped = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.cache()
val mapped = grouped.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
val counted = grouped.count()
There is no difference in terms of execution, but you might want to consider the readability of your code. I would go with your first example but like this:
var d = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
Really this is more of a Scala question than Spark though. Still, as you can see from Spark's implementation of word count as shown in their documentation
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
you don't need to worry about those kinds of things. The Scala language (through laziness, etc.) and Spark's RDD implementation handles all that at a higher level of abstraction.
If you find really bad performance, then you should take the time to explore why. As Knuth said, "premature optimization is the root of all evil."