Spark: understanding the DAG and forcing transformations - scala

Hello stackoverflow community.
I ask your help in understanding if my thoughts are correct or I'm missing some points in my Spark job.
I currently have two rdds that I want to subtract.
Both the rdds are built as different transformations on the same father RDD.
First of all, the father RDD is cached after it is obtained:
val fatherRdd = grandFather.repartition(n).mapPartitions(mapping).cache
Then the two rdds are transformed.
One is (pseudocode):
son1= rddFather.filter(filtering_logic).map(take_only_key).distinct
The other one is :
son2= rddFather.filter(filtering_logic2).map(take_only_key).distinct
The two sons are then subtracted to obtain only the keys in son1:
son1.subtract(son2)
I would expect the squence of the transformations to be the following:
mapPartitions
repartition
caching
Then, starting from cached data, map filter map distinct on both rdds and then subtracting.
This is not happening, what I see are the two distinct operations running in parallel, apparently not exploiting the benefits of caching (there are no skipped tasks), and taking almost the same computation time.
Below the image of the dag taken from spark ui.
Do you have any suggestions for me?

You are correct in your observations. Transformations on RDDs are lazy, so caching will happen after the first time the RDD is actually computed.
If you call an action on your parent RDD, it should be computed and cached. Then your subsequent operations will operate on the cached data.

Related

Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,
There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data don´t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.
The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

Is it ok to keep multiple DataFrames in Scala List or Map for Iterative processing

I have 3 DataFrames, each with 50 columns and millions of records. I need to apply some common transformations on the above DataFrames.
Currently, I'm keeping those DataFrames in a Scala List and performing the operations on each of them Iteratively.
My question is, Is it Ok to keep big DataFrames in Scala Collection or will it have any Performance related Issues. If yes, what is the best way to work on multiple DataFrames in an Iterative manner?
Thanks in advance.
There is no issue doing so, as List is just a reference to your DataFrame and DataFrames in Spark are lazy eval.
So until and unless you start working on any of the DataFrame i.e. calling action on them they will not get populated.
And as soon as the action is finished it will be cleared up.
So it will be equal to calling them separately 3 times, hence there is no issue with your approach.

Map on dataframe takes too long [duplicate]

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers

(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.