Spark Repeating `DataFrame` Processing Work? - scala

I am using Apache Spark (1.6) for a ML task and I noticed that Spark seems to be repeating processing on a single DataFrame.
My code looks something like this:
val df1 = sqlContext.read.parquet("data.parquet")
val df2 = df1.withColumn("new", explode(expensiveTextProcessing($"text"))
println(df2.count)
... (no changes to df2)
println(df2.count)
So I know that my withColumn is a transformation and count is an action so the count will seem like the longer operation.
However, I noticed that the second time I run df2.count takes just as long as the first df2.count. Additionally, a NLP tool I am using throws a few warnings during expensiveTextProcessing and these warnings show up during both of the count calls.
Is Spark doing all of the expensiveTextProcessing each time I use the data in df2?
(for more context you can see my actual Jupyter Notebook here)

DataFrame like RDD has lineage which used to built resulting DataFrame during action call. As you call count the results from all executors are collected to driver. You can check Spark Web UI DAG representation and staging of DataFrame and also duration and localization of processes in order to implement transformations.

Related

pyspark dataframe writing results

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.
The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?
I solved above problem with persist and cache in pyspark.
Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.
Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

How can I obtain the DAG of an Apache Spark job without running it?

I have some Scala code that I can run with Spark using spark-submit. From what I understood, Spark creates a DAG in order to schedule the operation.
Is there a way to retrieve this DAG without actually performing the heavy operations, e.g. just by analyzing the code ?
I would like a useful representation such as a data structure or at least a written representation, not the DAG visualization.
If you are using dataframes (spark sql) you can use df.explain(true) to get the plan and all operations (before and after optimization).
If you are using rdd you can use rdd.toDebugString to get a string representation and rdd.dependencies to get the tree itself.
If you use these without the actual action you would get a representation of what is going to happen without actually doing the heavy lifting.

How to access broadcasted DataFrame in Spark

I have created two dataframes which are from Hive tables(PC_ITM and ITEM_SELL) and big in size and I am using those
frequently in the SQL query by registering as table.But as those are big, it is taking much time
to get the query result.So I have saved them as parquet file and then read them and registered as temporary table.But still I am not getting good performance so I have broadcasted those data-frames and then registered as tables as below.
PC_ITM_DF=sqlContext.parquetFile("path")
val PC_ITM_BC=sc.broadcast(PC_ITM_DF)
val PC_ITM_DF1=PC_ITM_BC
PC_ITM_DF1.registerAsTempTable("PC_ITM")
ITM_SELL_DF=sqlContext.parquetFile("path")
val ITM_SELL_BC=sc.broadcast(ITM_SELL_DF)
val ITM_SELL_DF1=ITM_SELL_BC.value
ITM_SELL_DF1.registerAsTempTable(ITM_SELL)
sqlContext.sql("JOIN Query").show
But still I cant achieve performance it is taking same time as when those data frames are not broadcasted.
Can anyone tell if this is the right approach of broadcasting and using it?`
You don't really need to 'access' the broadcast dataframe - you just use it, and Spark will implement the broadcast under the hood. The broadcast function works nicely, and makes more sense that the sc.broadcast approach.
It can be hard to understand where the time is being spent if you evaluate everything at once.
You can break your code into steps. The key here will be performing an action and persisting the dataframes you want to broadcast before you use them in your join.
// load your dataframe
PC_ITM_DF=sqlContext.parquetFile("path")
// mark this dataframe to be stored in memory once evaluated
PC_ITM_DF.persist()
// mark this dataframe to be broadcast
broadcast(PC_ITM_DF)
// perform an action to force the evaluation
PC_ITM_DF.count()
Doing this will ensure that the dataframe is
loaded in memory (persist)
registered as temp table for use in your SQL query
marked as broadcast, so will be shipped to all executors
When you now run sqlContext.sql("JOIN Query").show you should now see a 'broadcast hash join' in the SQL tab of your Spark UI.
I would cache the rdds in memory. The next time they are needed, spark will read the RDD from memory rather than generating the RDD from scratch each time. Here is a link to the quick start docs.
val PC_ITM_DF = sqlContext.parquetFile("path")
PC_ITM_DF.cache()
PC_ITM_DF.registerAsTempTable("PC_ITM")
val ITM_SELL_DF=sqlContext.parquetFile("path")
ITM_SELL_DF.cache()
ITM_SELL_DF.registerAsTempTable("ITM_SELL")
sqlContext.sql("JOIN Query").show
rdd.cache() is shorthand for rdd.persist(StorageLevel.MEMORY_ONLY). There are a few levels of persistence you can choose from incase your data is too big for memory only persistence. Here is a list of persistence options. If you want to manually remove the RDD from the cache you can call rdd.unpersist().
If you prefer to broadcast the data. You must first collect it on the driver before you broadcast it. This requires that your RDD fits in memory on your driver (and executers).
At this moment you can not access broadcasted data frame in the SQL query. You can use brocasted data frame through only through data frames.
Refer: https://issues.apache.org/jira/browse/SPARK-16475

Is .parallelize(...) a lazy operation in Apache Spark?

Is parallelize (and other load operations) executed only at the time a Spark action is executed or immediately when it is encountered?
See def parallelize in spark code
Note the different consequences for instance for .textFile(...): Lazy evaluation would mean that while possibly saving some memory initially, the text file has to be read every time an action is performed and that a change in the text file would affect all actions after the change.
parallelize is executed lazily: see L726 of your cited code stating "#note Parallelize acts lazily."
Execution in Spark is only triggered once you call an action e.g. collect or count.
Thus in total with Spark:
Chain of transformations is set up by the user API (you) e.g. parallelize, map, reduce, ...
Once an action is called the chain of transformations is "put" into the Catalyst optimizer, gets optimized and then executed.
... (and other load operations)
parallelize is lazy (as already stated by the Martin Senne and Chandan), same as standard data loading operations defined on SparkContext like textFile.
DataFrameReader.load and related methods are in general only partialy lazy. Depending on a context it may require metadata access (JDBC sources, Cassandra) or even full data scan (CSV with schema inference).
Please note that here we have just defined RDD, data is not loaded still. This means that if you go to access the data in this RDD it could fail. The computation to create the data in an RDD is only done when the data is referenced; for example, it is created by caching or writing out the RDD.
cited from the link
Not only parallelize(), all transformations are lazy.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program
Have a look at this article to know all transformations in Scala.
Have a look at this documentation for more details.

how to cache data in apache spark that can be used by other spark job

I have a simple spark code in which I read a file using SparkContext.textFile() and then doing some operations on that data, and I am using spark-jobserver for getting output.
In code I am caching the data but after job ends and I execute that spark-job again then it is not taking that same file which is already there in cache. So, every time file is getting loaded which is taking more time.
Sample Code is as:
val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)
Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache.
Is there any way using which I can share the cached data among multiple spark jobs?
Yes - by calling persist/cache on the RDD you get and submitting additional jobs on the same context