Is .parallelize(...) a lazy operation in Apache Spark? - scala

Is parallelize (and other load operations) executed only at the time a Spark action is executed or immediately when it is encountered?
See def parallelize in spark code
Note the different consequences for instance for .textFile(...): Lazy evaluation would mean that while possibly saving some memory initially, the text file has to be read every time an action is performed and that a change in the text file would affect all actions after the change.

parallelize is executed lazily: see L726 of your cited code stating "#note Parallelize acts lazily."
Execution in Spark is only triggered once you call an action e.g. collect or count.
Thus in total with Spark:
Chain of transformations is set up by the user API (you) e.g. parallelize, map, reduce, ...
Once an action is called the chain of transformations is "put" into the Catalyst optimizer, gets optimized and then executed.

... (and other load operations)
parallelize is lazy (as already stated by the Martin Senne and Chandan), same as standard data loading operations defined on SparkContext like textFile.
DataFrameReader.load and related methods are in general only partialy lazy. Depending on a context it may require metadata access (JDBC sources, Cassandra) or even full data scan (CSV with schema inference).

Please note that here we have just defined RDD, data is not loaded still. This means that if you go to access the data in this RDD it could fail. The computation to create the data in an RDD is only done when the data is referenced; for example, it is created by caching or writing out the RDD.
cited from the link

Not only parallelize(), all transformations are lazy.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program
Have a look at this article to know all transformations in Scala.
Have a look at this documentation for more details.

Related

can't understand how does scala operation functions in Apache spark

hello everyone,
so I started to learn about Apache spark architecture and I understand how the data flow works in higher-level.
what I learn is that spark jobs work on stages that have tasks to operate on RDDS in which they are created with lazy transformations starting from the Spark console. (correct me if I'm wrong)
what I didn't get it :
there are other types of data structures in Spark: Data Frame and Dataset, and there are functions to manipulate them,
so what is the relation between those functions and the tasks applied on RDDs ?
coding with Scala has operations on RDD which is logic as far as I know, and there is also other types of data structure that I can do operations on them and manipulate them like list, Stream, vector, etc... so my question is
so how can spark execute these operations if they are not applied on RDDS ?
I have an estimation of time-complexity of each algorithm operating on any type of data structure of Scala referring to the official documents but I can't find the estimation of time-complexity of operations on RDDS, for example, count() or ReduceKey() applied in RDDS.
why we can't we evaluate exactly the complexity of Spark-app, and is it possible to evaluate elementary tasks complexities ?
more formally, what are RDDS and what is the relation between them and everything in Spark
if someone can clarify to me this confusion of information, I'd be grateful.
so what is the relation between those functions and the tasks applied on RDDs ?
DataFrames, Datasets, and RDD are three API from Spark. Check out this link
so how can spark execute these operations if they are not applied on RDDS?
RDD are structural Data structures, Actions and Transformation specified by Spark can be applied on RDD. Within RDD action or transformation, we do apply some of scala native operations. Each Spark API has it's own set operations. Read the link as shown in previous to get a better idea on how parallelism is achieved in the operations
why we can't we evaluate exactly the complexity of Spark-app, and is it possible to evaluate elementary tasks complexities ?
This article explains Map Reduce Complexity
https://web.stanford.edu/~ashishg/papers/mapreducecomplexity.pdf

Map on dataframe takes too long [duplicate]

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

Spark scala coding standards

I am reaching out to the community to understand the impact of coding in certain ways in scala for Spark. I received some review comments that I feel need discussion. Coming from a traditional Java and OOP background, I am writing my opinion and questions here. I would appreciate if you could chime in with your wisdom. I am in a Spark 1.3.0 environment.
1. Use of for loops: Is it against the rules to use for loops?
There are distributed data structures like RDDs and DataFrames in Spark. We should not be collecting and using for loops on them, as the computation will end up happening on the driver node alone. This will have adverse affects especially if the data is large.
But if I have a utility map that stores parameters for the job, it is fine to use a for loop on it if desired. Using a for loop or a map on the iteratable is a coding choice. It is important to understand that this map here is different from map on a distributed data structure. This map will still happen on the driver node alone.
2. Use of var vs val
val is an immutable reference to an object and var is a mutable reference. In the example below
val driverDf = {
var df = dataLoader.loadDriverInput()
df = df.sqlContext.createDataFrame(df.rdd, df.schema)
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
}
Even though we have used var for the df, driverDf is an immutable reference to the originally created data frame. This kind of use for var is perfectly fine.
Similarly the following is also fine.
var driverDf = dataLoader.loadDriverInput();
driverDf = applyTransformations (driverDf)
def applyTransformations (driverDf:DataFrame)={...}
Are there any generic rules that say vars cannot be used in Spark environment?
3. Use of if-else vs case, not throw exceptions
Is it against standard practices to not throw exceptions or not to use if-else?
4. Use of hive context vs sql context
Are there any performance implications of using SQLContext vs HiveContext (I know HiveContext extends SQLContext) for underneath Hive tables?
Is it against standards to create multiple HiveContexts in the program. My job is iterates through a part of a whole data frame of values every time. Whole data frame is cached in a one hive context. Each iteration data frame is created from the whole data using a new hive context and cached. This cache is purged at the end of iteration. This approach gave me performance improvements in Spark 1.3.0. Is this approach breaking any standards?
I appreciate the responses.
Regarding loops, as you mentioned correctly, you should prefer RDD map to perform operations in parallel and on multiple nodes. For smaller iterables, you can go with for loop. Again it comes down to the driver memory and time it takes to iterate.
For smaller sets of around 100, the distributed way of handling will incur unnecessary network usage rather than giving performance boost
val or var is a choice at scala level rather than spark. I never heard of it. Its dependent on your requirement.
No sure what you you asked. The only major negative for using if-else is making them cumbersome and while handling inner-if-else. Apart from that, all should be fine. An exception can be thrown based on a condition. I see that's the one of many ways to handle issues in an otherwise Happy path flow.
As mentioned here, the compiler generates more byte code for match..case rather than simple if. So its simple condition check vs code readability - complex condition check
HiveContext gives the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. Please not in spark 2.0, both HIveContext and SQLContext are replaced with SparkSession.

How to access broadcasted DataFrame in Spark

I have created two dataframes which are from Hive tables(PC_ITM and ITEM_SELL) and big in size and I am using those
frequently in the SQL query by registering as table.But as those are big, it is taking much time
to get the query result.So I have saved them as parquet file and then read them and registered as temporary table.But still I am not getting good performance so I have broadcasted those data-frames and then registered as tables as below.
PC_ITM_DF=sqlContext.parquetFile("path")
val PC_ITM_BC=sc.broadcast(PC_ITM_DF)
val PC_ITM_DF1=PC_ITM_BC
PC_ITM_DF1.registerAsTempTable("PC_ITM")
ITM_SELL_DF=sqlContext.parquetFile("path")
val ITM_SELL_BC=sc.broadcast(ITM_SELL_DF)
val ITM_SELL_DF1=ITM_SELL_BC.value
ITM_SELL_DF1.registerAsTempTable(ITM_SELL)
sqlContext.sql("JOIN Query").show
But still I cant achieve performance it is taking same time as when those data frames are not broadcasted.
Can anyone tell if this is the right approach of broadcasting and using it?`
You don't really need to 'access' the broadcast dataframe - you just use it, and Spark will implement the broadcast under the hood. The broadcast function works nicely, and makes more sense that the sc.broadcast approach.
It can be hard to understand where the time is being spent if you evaluate everything at once.
You can break your code into steps. The key here will be performing an action and persisting the dataframes you want to broadcast before you use them in your join.
// load your dataframe
PC_ITM_DF=sqlContext.parquetFile("path")
// mark this dataframe to be stored in memory once evaluated
PC_ITM_DF.persist()
// mark this dataframe to be broadcast
broadcast(PC_ITM_DF)
// perform an action to force the evaluation
PC_ITM_DF.count()
Doing this will ensure that the dataframe is
loaded in memory (persist)
registered as temp table for use in your SQL query
marked as broadcast, so will be shipped to all executors
When you now run sqlContext.sql("JOIN Query").show you should now see a 'broadcast hash join' in the SQL tab of your Spark UI.
I would cache the rdds in memory. The next time they are needed, spark will read the RDD from memory rather than generating the RDD from scratch each time. Here is a link to the quick start docs.
val PC_ITM_DF = sqlContext.parquetFile("path")
PC_ITM_DF.cache()
PC_ITM_DF.registerAsTempTable("PC_ITM")
val ITM_SELL_DF=sqlContext.parquetFile("path")
ITM_SELL_DF.cache()
ITM_SELL_DF.registerAsTempTable("ITM_SELL")
sqlContext.sql("JOIN Query").show
rdd.cache() is shorthand for rdd.persist(StorageLevel.MEMORY_ONLY). There are a few levels of persistence you can choose from incase your data is too big for memory only persistence. Here is a list of persistence options. If you want to manually remove the RDD from the cache you can call rdd.unpersist().
If you prefer to broadcast the data. You must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executers).
At this moment you can not access broadcasted data frame in the SQL query. You can use brocasted data frame through only through data frames.
Refer: https://issues.apache.org/jira/browse/SPARK-16475

(Why) do we need to call cache or persist on a RDD

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.