Does Spark do one pass through the data for multiple withColumn? - scala

Does Spark do one or multiple passes through data when multiple withColumn functions are chained?
For example:
val dfnew = df.withColumn("newCol1", f1(col("a")))
.withColumn("newCol2", f2(col("b")))
.withColumn("newCol3", f3(col("c")))
where
df is my input DataFrame containing at least columns a, b, c
dfnew is output DataFrame with three new columns newCol1, newCol2, newCol3
f1, f2, f3 are some user defined functions or some spark operations on columns like cast, etc In my project I can have even 30 independent withColumn function chained with foldLeft.
Important
I am assuming here that f2 does not depend on result of f1, while f3 does not depend on result of f1 and f2. The functions could be performed in any order. There is no shuffle in any function
My observations
all functions are in the same stage
addition of new withColumn does not increase execution time in such a way to suspect additional passages through data.
I have tested for example single SQLTransformer with select statement containing all functions vs multiple separate SQLTransformer one for each function and the execution time was similar.
Questions
Will spark make one or three passages through the data, once for each withColumn?
Does it depend on the type of functions f1, f2, f3? UDF vs generic Spark operations?
If the functions f1, f2, f3 are inside the same stage, does it mean they are in the same data pass?
Does number of passages depend on shuffles within functions? If there is no shuffle?
If I chain the withColumn functions with foldLeft will it change number of passages?
I could do something similar with three SQLTransformers or just one SQLTransformer with all three transformations in the same select_statement. How many passes through data that would do?
Basically it doesn't matter, the time of execution will be similar for 1 and 3 passages?

Will spark make one or three passages through the data, once for each withColumn?
Spark will "make one passage" through the data. Why? Because spark doesn't actually do anything when this code is reached, it just builds an execution plan which would tell it what to do when dfnew is used (i.e. some action, e.g. count, collect, write etc.) is executed on it. Then, it would be able to compute all functions at once for each record.
Does it depend on the type of functions f1, f2, f3? UDF vs generic Spark operations?
No.
If the functions f1, f2, f3 are inside the same stage, does it mean they are in the same data pass?
Yes.
Does number of passages depend on shuffles within functions? If there is no shuffle?
Almost. First, as long as no caching / checkpointing is used, the number of passages over the data will be the number of actions executed on the resulting newdf DataFrame. Then, each shuffle means each record is read, potentially sent between worker nodes, potentially written to disk, and then read again.
If I chain the withColumn functions with foldLeft will it change number of passages?
No. It will only change the way the above-mentioned plan is constructed, but it will have no effect on how this plan looks (would be the exact same plan), so the computation will remain the same.
I could do something similar with three SQLTransformers or just one SQLTransformer with all three transformations in the same select_statement. How many passes through data that would do?
Again, this won't make any difference, as the execution plan will remain the same.
Basically it doesn't matter, the time of execution will be similar for 1 and 3 passages?
Not sure what this means, but sounds like this is not correct: the time of execution is mostly a factor of number of shuffles and number of actions (assuming same data and same cluster setup).

Related

Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,
There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data don´t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.
The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

Map on dataframe takes too long [duplicate]

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers

Caching an intermediate dataframe when there is only one action

In Spark, let's say that I have a dataframe that undergoes some 100 transformations and then there is a single action applied. Will caching an intermediate dataframe help under any circumstances? I can see that caching will help when there is more than one action applied on a dataframe but how about a single action?
To clarify:
I have a dataframe A using which I obtain 2 different dataframes B and C. Then I do a union of B and C to form D on which I apply an action. Imagine this happening in a very complicated scenario with lots of branches. Will caching A speed up the process?
Caching a DataFrame has no benefit on the first time it needs to be evaluated (in fact it actually has a performance cost and obviously increases memory use). It's only on reuse that caching helps.
If you split A into B and C, then use both B and C, you've just used A twice, so caching it will help.
Number of Actions is not an important measure, what matters is the execution path.

passing RDD into an utility function

In order to make code more clean, I put some tasks/functions in the main code into an utility class/function, then pass the entire RDD to the function, like:
val myResultRDD = MyUtiltity.processData(myRDD1, myRDD2, myRDD3).saveAsTextFile("output", classOf[GzipCodec])
then the code becomes very slow compare with keeping everything in the main code. I am wondering if I have 10 executors, does the job copy myRDD1, myRDD2, and myRDD3 to each executor? So I have 10 myRDD1, 10 myRDD2, and 10 myRDD3 in the memories?
As long as you don't wastefully cache() or collect() RDDs inside your utility function, then what you have here should not affect performance.
Applying a sequence of transformation (e.g., map, fold, reduce, etc.) to any number of RDDs and combining them into a new RDD (say, though joins) does not execute anything until you either collect or persist the RDD. Therefore, adding your sequence of transformations in a function or having them scattered in your "main" should not affect performance.