Map on dataframe takes too long [duplicate] - scala

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.

Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

Related

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers

What's the difference between array.map and rdd.map in Spark/Scala?

I found that the map function for RDD generates map task, and map function of array doesn't generate any new task, so is the reduce function.
What's the difference between them, and is it encouraged to use map/reduce function instead of for/foreach anywhere anytime?
I find map function for rdd generates map task, and map function of
array doesn't generate any new task
This is a bit of an apples to oranges comparison.
An RDD is an abstraction of a distributed dataset. When you're operating on one, the transformation creates a lazy evaulated MapPartitionsRDD, which is itself another RDD.
When you're working on an Array[T], everything is local and in-memory, the transformation can be from an Array[T] an Array[U] or anything of that such, and it is evaluated strictly.
An RDD is divided into partitions, which themself can be viewed as smaller collections, each run in a distributed fashion, while an Array[T] has none of these properties, unless the underlying type T is itself an Array[U].
is it encouraged to use map/reduce function instead of for/foreach
anywhere anytime?
Again, it's hard to answer such a question. Map-Reduce is a general programming model used for distributed parallel computations, while for and foreach are programming constructs used for a very specific purpose.
Spark scheduler (running in driver process) do not schedule any tasks for arrays or any other data structure other than RDD and DStreams.
It recognizes all operations(either transformations or actions) on RDD/DStreams and schedule jobs for them, which are divided into stages and further into tasks.
scheduler-->(knows RDD & schedules)-->Jobs-->(run in)-->Stages-->(evaluated in)-->Tasks
scheduler-->(does not know array)-->ignore
When you say map/reduce I consider it as map and reduce and foreach as foreach. All are for different purposes as described in links. Make sure what you exactly want to know here.

Is .parallelize(...) a lazy operation in Apache Spark?

Is parallelize (and other load operations) executed only at the time a Spark action is executed or immediately when it is encountered?
See def parallelize in spark code
Note the different consequences for instance for .textFile(...): Lazy evaluation would mean that while possibly saving some memory initially, the text file has to be read every time an action is performed and that a change in the text file would affect all actions after the change.
parallelize is executed lazily: see L726 of your cited code stating "#note Parallelize acts lazily."
Execution in Spark is only triggered once you call an action e.g. collect or count.
Thus in total with Spark:
Chain of transformations is set up by the user API (you) e.g. parallelize, map, reduce, ...
Once an action is called the chain of transformations is "put" into the Catalyst optimizer, gets optimized and then executed.
... (and other load operations)
parallelize is lazy (as already stated by the Martin Senne and Chandan), same as standard data loading operations defined on SparkContext like textFile.
DataFrameReader.load and related methods are in general only partialy lazy. Depending on a context it may require metadata access (JDBC sources, Cassandra) or even full data scan (CSV with schema inference).
Please note that here we have just defined RDD, data is not loaded still. This means that if you go to access the data in this RDD it could fail. The computation to create the data in an RDD is only done when the data is referenced; for example, it is created by caching or writing out the RDD.
cited from the link
Not only parallelize(), all transformations are lazy.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program
Have a look at this article to know all transformations in Scala.
Have a look at this documentation for more details.

passing RDD into an utility function

In order to make code more clean, I put some tasks/functions in the main code into an utility class/function, then pass the entire RDD to the function, like:
val myResultRDD = MyUtiltity.processData(myRDD1, myRDD2, myRDD3).saveAsTextFile("output", classOf[GzipCodec])
then the code becomes very slow compare with keeping everything in the main code. I am wondering if I have 10 executors, does the job copy myRDD1, myRDD2, and myRDD3 to each executor? So I have 10 myRDD1, 10 myRDD2, and 10 myRDD3 in the memories?
As long as you don't wastefully cache() or collect() RDDs inside your utility function, then what you have here should not affect performance.
Applying a sequence of transformation (e.g., map, fold, reduce, etc.) to any number of RDDs and combining them into a new RDD (say, though joins) does not execute anything until you either collect or persist the RDD. Therefore, adding your sequence of transformations in a function or having them scattered in your "main" should not affect performance.

(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.