What is the faster way to count the number of entries in a data frame? - scala

I have a data frame df that contains around 1 Gb of data. Why the command df.count() takes a relatively long time to complete, while df.filter(...) is much faster? Is there any better way to estimate the number of entries in df that is faster than df.count()'

df.count() is the correct way.
Note that df.filter(...) is a transformation, which means it is lazy, i.e. the filtering code isn't executed yet. It will only be executed if you add an actiton like count or collect to the filtered result. And then the runtime should be similar to the original call to count.

Related

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,
df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)
Could you please help us on what we are doing wrong.
Note : There are no duplicate values in df and we have done multiple joins to form the df
Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set.
(Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:
spark.sql.adaptive.enabled = true
spark.sql.adaptive.skewJoin.enabled = true
Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.
Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)
(Finally If you are using hive tables, you should make sure the table stats are up to date.)
ANALYZE TABLE mytable COMPUTE STATISTICS;

How to calculate number of rows obtained after a join, a filter or a write without using count function - Pyspark

I am using PySpark to join, filter and write large dataframe to a csv.
After each filter join or write, I count the number of lines with df.count().
However, counting the number of rows mean reloading the data and re-perform the various operations.
How could I count the number of lines during each different operations without reloading and calculate as with df.count() ?
I am aware that the cache function could be a solution to not reload and recalculate but I am looking for another solution as it's not always the best one.
Thank you in advance!
Why not look at the spark UI to see get a feel for what's happening instead of using Count? This might help you get a feel without actually doing counts. Jobs/Tasks can help you find the bottlenecks. The SQL tab can help you look at your plan to understand what's actually happening.
If you want something better than count.
countApprox is cheaper.(RDD level tooling) You should be caching if you are going to count it and then use it the dataframe again after. Actually count is sometimes use to force caching.

Optimised way of doing cumulative sum on large number of columns in pyspark

I have a DataFrame containing 752 (id,date and 750 feature columns) columns and around 1.5 million rows and I need to apply cumulative sum on all 750 feature columns partition by id and order by date.
Below is the approach I am following currently:
# putting all 750 feature columns in a list
required_columns = ['ts_1','ts_2'....,'ts_750']
# defining window
sumwindow = Window.partitionBy('id').orderBy('date')
# Applying window to calculate cumulative of each individual feature column
for current_col in required_columns:
new_col_name = "sum_{0}".format(current_col)
df=df.withColumn(new_col_name,sum(col(current_col)).over(sumwindow))
# Saving the result into parquet file
df.write.format('parquet').save(output_path)
I am getting below error while running this current approach
py4j.protocol.Py4JJavaError: An error occurred while calling o2428.save.
: java.lang.StackOverflowError
Please let me know alternate solution for the same. seems like cumulative sum is bit tricky with large amount of data. Please suggest any alternate approach or any spark configurations which I can tune to make it work.
I expect you have the issue of too large of a lineage. Take a look at your explain plan after you re-assign the dataframe so many times.
The standard solution for this is to checkpoint your dataframe every so often to truncate the explain plan. This is sort of like caching but for the plan rather than the data and is often needed for iterative algorithms that modify dataframes.
Here is a nice pyspark explanation of caching and checkpointing
I suggest df.checkpoint() every 5-10 modifications to start with
Let us know how it goes

pyspark dataframe writing results

I am working on a project where i need to read 12 files average file size is 3 gb. I read them with RDD and create dataframe with spark.createDataFrame. Now I need to process 30 Sql queries on the dataframe most them need output of previous one like depend on each other so i save all my intermediate state in dataframe and create temp view for that dataframe.
The program takes only 2 minutes for execute part but the problem is while writing them to csv file or show the results or calling count() function takes too much time. I have tries re-partition thing but still it is taking to much time.
1.What could be the solution?
2.Why it is taking too much time to write even all processing taking small amount of time?
I solved above problem with persist and cache in pyspark.
Spark is a lazy programming language. Two types of Spark RDD operations are- Transformations and Actions. A Transformation is a function that produces new RDD from the existing RDDs but when we want to work with the actual dataset, at that point Action is performed. When the action is triggered after the result, new RDD is not formed like transformation.
Every time i do some operation it was just transforming, so if i call that particular dataframe it will it parent query every time since spark is lazy,so adding persist stopped calling parent query multiple time. It saved lots of processing time.

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers