pyspark: evaluate the sum of all elements in a dataframe - pyspark

I am trying to evaluate, in pyspark, the sum of all elements of a dataframe. I wrote the following function
def sum_all_elements(df):
df = df.groupBy().sum()
df = df.withColumn('total', sum(df[colname] for colname in df.columns))
return df.select('total').collect()[0][0]
To speed up the function I have tried to convert to rdd and sum as
def sum_all_elements_pyspark(df):
res = df.rdd.map(lambda x: sum(x)).sum()
return res
But apparently the rdd function is slower than the dataframe's one. Is there a way to speed up the rdd function?

Dataframe functions are faster than rdd as Catalyst optimizer optimizes the actions performed over the dataframes but it doesn't do the same for rdd's.
WHen you execute actions over dataframe api it generates a optimized logical plan and that optimized logical plan is converted into multiple physical plans which then goes through the cost based optimization and choosing the best physical plan.
Now, the final physical plan is rdd equivalent code to execute because at low level rdd's are used.
So using dataframe api based function will provide you the required performance boost.

Related

What's the difference between takeOrdered and sortBy + take in pySpark?

If I want to get the sorted top-k values of some RDD, what's the difference between takeOrdered function and sortBy + take function? Is the previous one faster?

I want to split a big dataframe into multiple dataframes on basis of the rowcount in spark using scala. I am not able to figure it out

For example,
I have a dataframe with 12000 rows, and I define a threshold of 3000(externally supplied to my code via config), so I would like to split this dataframe into 4 dataframes with 3000 rows each.
If the dataframe has 12500 rows, I will split it into 5 dataframes, 4 with 3000 rows and last one with 500.
**The significance of dataframe here is that if the rowcount of the dataframe is lesser than this, I will not tinker with it.
In order to get a close behavior (but not exactly what you specified), you can use the function randomSplit: randomSplit(weights: Array[Double])
The weights are the proportions in each output dataframe. Henve, you cannot specify exactly the number.
In your case (12500 rows), df.randomSplit(Array(1,1,1,1)) to separate it into 4 approximately equal parts. Or:
val n = (df.length / 4).toInt
df.randomSplit((1 to n).map(x => 1).toArray)
PS: I think you will cuts of a spark dataframe with "exact number of rows" with Spark

Efficient histogram on multiple columns in one shot

Here's how I calculate histogram on one column:
val df = spark.read.format("csv").option("header", "true").load("/project/test.csv")
df.map(row => row.getString(2).toDouble).rdd.histogram(10)
I want to calculate histograms on all columns. I can simply repeat the second line (see code above) and call histogram separately on each column. But my concern is that Spark will load data from disk each time I call histogram(), which means that if there are 10 columns, data is loaded 10 times. Is there a more efficient way to do this? How can I calculate histograms on all 10 columns in one shot?
Edit
Here's one way to combine multiple histogram() calls into one expression:
val histograms = {
val a = df.map(row => row.getString(0).toDouble).rdd.histogram(10)
val b = df.map(row => row.getString(1).toDouble).rdd.histogram(15)
(a, b)
}
Does this guarantee that the histograms will be computed with only one pass over the data? Is combining multiple histogram calls into one expression the trick? Or is that even necessary? Doesn't Spark delay evaluation until the result is used in any case, even when separate statements are used?

efficient aggregation (sum) on a single column Data Frame in spark scala

I have a Spark Data Frame with a single column and large number of rows (in billions). I am trying to calculate the sum of the values in each row using the code shown below. However, it is very slow. Is there an efficient way to calculate the sum?
val df = sc.parallelize(Array(1,3,5,6,7,10,30)).toDF("colA")
df.show()
df.agg(sum("colA")).first().get(0) //very slow
Similar query was posted here: How to sum the values of one column of a dataframe in spark/scala
The focus of this query is however about efficiency.

Looking for a way to Calculate Frequency distribution of a dataframe in spark/scala

I want to calculate the frequency distribution(return most common element in each column and the number of times it appeared) of a dataframe using spark and scala. I've tried using DataFrameStatFunctions library but after I filter my dataframe for only numeric type columns, I cant apply any functions from the library. Is the best way to do this to create a UDF?
you can use
val newDF = df.groupBy("columnName").count()
newDF.show()
it will show you the frequency count for unique entries.