Larger number of Transformations on multiple dataframe in Spark - scala

I have a transform engine built on the spark that is metadata-driven. I perform a set of transformations on multiple data frames stored in memory in a Scala Map[String, DataFrame]. I encounter a condition where I generate a data frame using 84 transforms including(withColumn, Join, union etc). After these, the output data frame is used as an input to another set of transformations.
If I write the intermediate transformation result after first 84 transformations and then load the dataframe into the Map from the output path. The next set of transformations works fine. If I do not this, it takes 30 mins just to evaluate.
My Approach: I tried persisting the Dataframe using:
dfMap(target).cache()
But this approach did not help.

So in these 84 transformations , how many of them are aggregations based on a same key? For example if you are calculating min, max etc., on a particular column value such as user_id, then it makes to store your original dataframe after bucketing it by user id. Also for joining, if you are using the same key, you can partition it by them. If you dont bucket, then for each transformation, there is a spark shuffle.
This answer should help - Spark Data set transformation to array

Related

Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,
There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data donĀ“t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.
The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

Issues while trying to export pyspark pandas dataframe to csv in pyspark

df=df_full[df_fill.part_col.isin(['part_a','part_b'])]
df=df[df.some_other_col =='some_value']
#df has shape of roughly 240k,200
#df_full has shape of roughly 30m, 200
df.to_pandas().reset_index().to_csv('testyyy.csv',index=False)
If I do any groupby operation it is amazingly fast. However the issue lies when I try to export small subset of this large dataset to csv. While I am eventually able to export the dataframe to csv but it is taking too much time.
Warnings:
2022-05-08 13:01:15,948 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[column_name] = series
Note: part_a and part_b are stored as two separate parquet partitioned files. Also I am using pyspark.pandas in spark3+
So question is what is happening? And what is most efficient wat to export the filtered dataframe to csv?

What does rdd mean in pyspark dataframe

I am new to to pyspark. I am wondering what does rdd mean in pyspark dataframe.
weatherData = spark.read.csv('weather.csv', header=True, inferSchema=True)
These two line of the code has the same output. I am wondering what the effect of having rdd
weatherData.collect()
weatherData.rdd.collect()
A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.
So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.
An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.
However, you can go from a DataFrame to an RDD via its .rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the .toDF() method
In general, it is recommended to use a DataFrame where possible due to the built in query optimization.

pyspark dataframe writing results

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.
The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?
I solved above problem with persist and cache in pyspark.
Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.
Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

Creating/accessing dataframe inside the transformation of another dataframe

I'm retrofitting some existing code to use Spark. I've multiple data frames that hold different data sets.
While transforming my main dataframe (or my main data set), I need to use data from the other data frames to complete the transformation. I also have a situation (atleast in the current structure) where I need to create new data frames in the transformation function of another data frames.
I'm trying to determine the following:
Can I access a data frame inside the transformation function of another data frame?
Can a data frame be created on an executor inside the transformation function of a dataframe?
Pointers on how to deal with such a situation would be very helpful.
The answer to both questions is NO:
DataFrames are driver-side abstractions of distributed collections. They cannot be used, created, or referenced in any executor-side transformation.
Why?
DataFrames (like RDDs and Datasets) can only be used within the context of an active SparkSession - without it, the DataFrame cannot "point" to its partitions on the active executors; The SparkSession should be thought of as a live "connection" to the cluster of executors.
Now, if you try using a DataFrame inside another transformation, that DataFrame would have to be serialized on the driver side, sent to the executor(s), and then deserialized there. But this deserialized instance (in a separate JVM) would necessarily lose it's SparkSession - that "connection" was from the driver to the executor, not from this new executor we're now operating in.
So what should you do?
You have a few options for referencing one DataFrame's data in another, and choosing the right one is mostly dependent on the amounts of data that would have to be shuffled (or - transferred between executors):
Collect one of the DataFrames (if you can guarantee it's small!), and then use the resulting local collection (either directly or using spark.broadcast) in any transformation.
Join the two DataFrames on some common fields. This is a very common solution, as the logic of using one DataFrame's data when transforming another usually has to do with some kind of "lookup" for the right value based on some subset of the columns. This usecase translates into a JOIN operation rather naturally
Use set operators like except, intersect and union, if they provide the logical operation you're after.