Spark datafarmes comparison - performance optimization in Scala - scala

I am trying to compare 2 dataframes in Scala. Here are the steps I am using- read data from database into source and target dataframes respectively. Get the counts of each and compare. Generate dataframes with differences - source minus target dataframe and target-minus dataframe. Compare the counts of difference dataframes. I am using accumulators to generate counts. But this is taking very long time. Any suggestions would help.

Related

Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,
There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data donĀ“t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.
The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

Is it ok to keep multiple DataFrames in Scala List or Map for Iterative processing

I have 3 DataFrames, each with 50 columns and millions of records. I need to apply some common transformations on the above DataFrames.
Currently, I'm keeping those DataFrames in a Scala List and performing the operations on each of them Iteratively.
My question is, Is it Ok to keep big DataFrames in Scala Collection or will it have any Performance related Issues. If yes, what is the best way to work on multiple DataFrames in an Iterative manner?
Thanks in advance.
There is no issue doing so, as List is just a reference to your DataFrame and DataFrames in Spark are lazy eval.
So until and unless you start working on any of the DataFrame i.e. calling action on them they will not get populated.
And as soon as the action is finished it will be cleared up.
So it will be equal to calling them separately 3 times, hence there is no issue with your approach.

Spark - How to calculate percentiles in Spark 1.6 dataframe?

I am using spark 1.6. I need to find multiple percentiles for a column in dataframe. My data is huge with atleast 10 million records. I tried using hive context like below
hivecontext.sql("select percentile_approx(col,0.25),percentile_approx(col,0.5) from table")
But this approach is very slow and takes a lot of time. I heard about approxQuantile but seems it is available in spark 2.x. Is there any alternate approach in spark 1.6 using spark dataframe to improve performance.
I saw another approach using hive UDAF like below
import org.apache.spark.sql.functions.{callUDF, lit}
df.agg(callUDF("percentile_approx", $"someColumn", lit(0.8)).as("percentile80"))
Will the above approach improve performance.
I used percentile_approx(col,array(percentile_value_list)) function. Then split returned array to individual . It improved performance without calling function multiple times.

Spark Dataframe performance for overwrite

Is there any performance difference or considerations between the following two pyspark statements:
df5 = df5.drop("Ratings")
and
df6 = df5.drop("Ratings)
Not specifically targeting the drop function, but any operation. Was wondering what happens under the hood when you overwrite a variable compared to creating a new one.
Also, is the behavior and performance considerations the same if this was an RDD and not a dataframe ?
No, There won't be any difference in the operation.
In case of Numpy, There is a option of flag which shows whether its own the data or not.
variable_name.flag
In case of Pyspark, the Dataframe is immutable and every change in the dataframe creates a new Dataframe. How does it do ? well, Dataframe is stored in distributed fashion. So, to move data in memory costs. Therefore, they change the ownership of data from a Dataframe to another, more particularly where index of the data is stored.
and
Dataframe is way better than RDD. Here is a good blog.
Dataframe RDD and dataset

Group data based on multiple column in spark using scala's API

I have an RDD, want to group data based on multiple column. for large dataset spark cannot work using combineByKey, groupByKey, reduceByKey and aggregateByKey, these gives heap space error. Can you give another method for resolving it using Scala's API?
You may want to use treeReduce() for doing incremental reduce in Spark. However, you hypothesis that spark can not work on large dataset is not true, and I suspect you just don't have enough partitions in your data, so maybe a repartition() is what you need.