Optimised way of doing cumulative sum on large number of columns in pyspark - pyspark

I have a DataFrame containing 752 (id,date and 750 feature columns) columns and around 1.5 million rows and I need to apply cumulative sum on all 750 feature columns partition by id and order by date.
Below is the approach I am following currently:
# putting all 750 feature columns in a list
required_columns = ['ts_1','ts_2'....,'ts_750']
# defining window
sumwindow = Window.partitionBy('id').orderBy('date')
# Applying window to calculate cumulative of each individual feature column
for current_col in required_columns:
new_col_name = "sum_{0}".format(current_col)
df=df.withColumn(new_col_name,sum(col(current_col)).over(sumwindow))
# Saving the result into parquet file
df.write.format('parquet').save(output_path)
I am getting below error while running this current approach
py4j.protocol.Py4JJavaError: An error occurred while calling o2428.save.
: java.lang.StackOverflowError
Please let me know alternate solution for the same. seems like cumulative sum is bit tricky with large amount of data. Please suggest any alternate approach or any spark configurations which I can tune to make it work.

I expect you have the issue of too large of a lineage. Take a look at your explain plan after you re-assign the dataframe so many times.
The standard solution for this is to checkpoint your dataframe every so often to truncate the explain plan. This is sort of like caching but for the plan rather than the data and is often needed for iterative algorithms that modify dataframes.
Here is a nice pyspark explanation of caching and checkpointing
I suggest df.checkpoint() every 5-10 modifications to start with
Let us know how it goes

Related

Best practice in Spark to filter dataframe, execute different actions on resulted dataframes and then union the new dataframes back

Since I am new to Spark I would like to ask a question about a pattern that I am using in Spark but don't know if it's a bad practice ( splitting a dataframe in two based on a filter, execute different actions on them and then joining them back ).
To give an example, having dataframe df:
val dfFalse = df.filter(col === false).distinct()
val dfTrue = df.filter(col === true).join(otherDf, Seq(id), "left_anti").distinct()
val newDf = dfFalse union dfTrue
Since my original dataframe has milions of rows I am curious if this filtering twice is a bad practice and I should use some other pattern in Spark which I may not be aware of. In other cases I even need to do 3,4 filters and then apply different actions to individual data frames and then union them all back.
Kind regards,
There are several key points to take into account when you start to use Spark to process big amounts of data in order to analyze our performance:
Spark parallelism depends of the number of partitions that you have in your distributed memory representations(RDD or Dataframes). That means that the process(Spark actions) will be executed in parallel across the cluster. But note that there are two main different kind of transformations: Narrow transformations and wide transformations. The former represent operations that will be executed without shuffle, so the data donĀ“t need to be reallocated in different partitions thus avoiding data transfer among workers. Consider that if you what to perform a distinct by a specific key Spark must reorganize the data in order to detect the duplicates. Take a look to the doc.
Regarding doing more or less filter transformations:
Spark is based on a lazy evaluation model, it means that all the transformations that you executes on a dataframe are not going to be executed unless you call an action, for example a write operation. And the Spark optimizer evaluates your transformations in order to create an optimized execution plan. So, if you have five or six filter operations it will never traverse the dataframe six times(in contrast to other dataframe frameworks). The optimizer will take your filtering operations and will create one. Here some details.
So have in mind that Spark is a distributed in memory data processor and it is a must to know these details because you can spawn hundreds of cores over hundred of Gbs.
The efficiency of this approach highly depends on the ability to reduce the amount of the overlapped data files that are scanned by both the splits.
I will focus on two techniques that allow data-skipping:
Partitions - if the predicates are based on a partitioned column, only the necessary data will be scanned, based on the condition. In your case, if you split the original dataframe into 2 based on a partitioned column filtering, each dataframe will scan only the corresponding portion of the data. In this case, your approach will be perform really well as no data will be scanned twice.
Filter/predicate pushdown - data stored in a format supporting filter pushdown (Parquet for example) allows reading only the files that contains records with values matching the condition. In case that the values of the filtered column are distributed across many files, the filter pushdown will be inefficient since the data is skipped on a file basis and if a certain file contains values for both the splits, it will be scanned twice. Writing the data sorted by the filtered column might improve the efficiency of the filter pushdown (on read) by gathering the same values into a fewer amount of files.
As long as you manage to split your dataframe, using the above techniques, and minimize the amount of the overlap between the splits, this approach will be more efficient.

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

Issues while trying to export pyspark pandas dataframe to csv in pyspark

df=df_full[df_fill.part_col.isin(['part_a','part_b'])]
df=df[df.some_other_col =='some_value']
#df has shape of roughly 240k,200
#df_full has shape of roughly 30m, 200
df.to_pandas().reset_index().to_csv('testyyy.csv',index=False)
If I do any groupby operation it is amazingly fast. However the issue lies when I try to export small subset of this large dataset to csv. While I am eventually able to export the dataframe to csv but it is taking too much time.
Warnings:
2022-05-08 13:01:15,948 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
df[column_name] = series
Note: part_a and part_b are stored as two separate parquet partitioned files. Also I am using pyspark.pandas in spark3+
So question is what is happening? And what is most efficient wat to export the filtered dataframe to csv?

Performance Improvement in scala dataframe operations

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.
The table schema is as shown below:
+-----------------+--------------------+------------+---------+--------+---------------+
| ID| readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+
Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:
val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
These window functions are used to achieve multiple logic on the data. Even there are few joins used to process the data.
The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date
If I am not partitioning the final table I am getting the below error.
Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.
Below provided the cluster configuration and utilization metrics.
Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...
Any leads Appreciated!
spark.driver.maxResultSize is just a setting you can increase it. BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work. You are doing the correct thing asking for help to optimize.
The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better. The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it. The trick could be to use multiple queries instead of one. And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time. Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? You should explore your options.
Given your other answers, you should be grouping by ID not windowing by Id. And likely using aggregates(sum) by week of year/month. This would likely give you really speedy performance with the loss of some granularity. This would give you enough insight to decide to look into something deeper... or not.
If you wanted more accuracy, I'd suggest using:
Converting your null's to 0's.
val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls
Then create a running total on the SIG_XX VAL or whatever signal you want to look into. Call the new column 'null-partitions'.
This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations. Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Group by uses a more of the cluster to do the work and speeds up the process.

What is the faster way to count the number of entries in a data frame?

I have a data frame df that contains around 1 Gb of data. Why the command df.count() takes a relatively long time to complete, while df.filter(...) is much faster? Is there any better way to estimate the number of entries in df that is faster than df.count()'
df.count() is the correct way.
Note that df.filter(...) is a transformation, which means it is lazy, i.e. the filtering code isn't executed yet. It will only be executed if you add an actiton like count or collect to the filtered result. And then the runtime should be similar to the original call to count.