efficient aggregation (sum) on a single column Data Frame in spark scala - scala

I have a Spark Data Frame with a single column and large number of rows (in billions). I am trying to calculate the sum of the values in each row using the code shown below. However, it is very slow. Is there an efficient way to calculate the sum?
val df = sc.parallelize(Array(1,3,5,6,7,10,30)).toDF("colA")
df.show()
df.agg(sum("colA")).first().get(0) //very slow
Similar query was posted here: How to sum the values of one column of a dataframe in spark/scala
The focus of this query is however about efficiency.

Related

Creating a pivot table in PySpark

New to PySpark and would like to make a table that counts the unique pairs of values from two columns and shows the average of another column over all rows with those pairs of values. My code so far is:
df1 = df.withColumn('trip_rate', df.total_amount / df.trip_distance)
df1.groupBy('PULocationID', 'DOLocationID').count().orderBy('count', ascending=False).show()
I want to add the average of the trip rate for each unique pair as a column. Can you help me please?

pyspark: evaluate the sum of all elements in a dataframe

I am trying to evaluate, in pyspark, the sum of all elements of a dataframe. I wrote the following function
def sum_all_elements(df):
df = df.groupBy().sum()
df = df.withColumn('total', sum(df[colname] for colname in df.columns))
return df.select('total').collect()[0][0]
To speed up the function I have tried to convert to rdd and sum as
def sum_all_elements_pyspark(df):
res = df.rdd.map(lambda x: sum(x)).sum()
return res
But apparently the rdd function is slower than the dataframe's one. Is there a way to speed up the rdd function?
Dataframe functions are faster than rdd as Catalyst optimizer optimizes the actions performed over the dataframes but it doesn't do the same for rdd's.
WHen you execute actions over dataframe api it generates a optimized logical plan and that optimized logical plan is converted into multiple physical plans which then goes through the cost based optimization and choosing the best physical plan.
Now, the final physical plan is rdd equivalent code to execute because at low level rdd's are used.
So using dataframe api based function will provide you the required performance boost.

I want to split a big dataframe into multiple dataframes on basis of the rowcount in spark using scala. I am not able to figure it out

For example,
I have a dataframe with 12000 rows, and I define a threshold of 3000(externally supplied to my code via config), so I would like to split this dataframe into 4 dataframes with 3000 rows each.
If the dataframe has 12500 rows, I will split it into 5 dataframes, 4 with 3000 rows and last one with 500.
**The significance of dataframe here is that if the rowcount of the dataframe is lesser than this, I will not tinker with it.
In order to get a close behavior (but not exactly what you specified), you can use the function randomSplit: randomSplit(weights: Array[Double])
The weights are the proportions in each output dataframe. Henve, you cannot specify exactly the number.
In your case (12500 rows), df.randomSplit(Array(1,1,1,1)) to separate it into 4 approximately equal parts. Or:
val n = (df.length / 4).toInt
df.randomSplit((1 to n).map(x => 1).toArray)
PS: I think you will cuts of a spark dataframe with "exact number of rows" with Spark

Add a column value in a row based on every values of this same row

My question might be dumb or anything else. But I was wondering :
I want to do structured streaming
I want to both aggregate and score the data with a Sparkling Water model
So I have this
val data_processed = data_raw
.withWatermark("timestamp", "10 minutes")
.groupBy(window(col("timestamp"),"1 minute"))
.agg(
*** all aggregations ***
)
What I want to add is like :
.withColumn("row_scored",scoring(all_others_cols))
So for every row in structured streaming it will score after the aggregation. But I don't think that can be possible. So I'm wondering if you think of another approach.
I'm using Sparkling Water so the scoring functions needs a H2O Frame. I was thinking to create an udf like that :
select all other columns,
create a row and transform it to dataframe
convert the dataframe composed of one row to H2O Frame
predict the H2O Frame of one row
transform the prediction from H20 Frame to dataframe
get the score in the dataframe to double and return it with udf
But I don't think that's quite optimised, maybe you have a fresh approach or remarks that will make see another way to do this.
Thanks in advance

Looking for a way to Calculate Frequency distribution of a dataframe in spark/scala

I want to calculate the frequency distribution(return most common element in each column and the number of times it appeared) of a dataframe using spark and scala. I've tried using DataFrameStatFunctions library but after I filter my dataframe for only numeric type columns, I cant apply any functions from the library. Is the best way to do this to create a UDF?
you can use
val newDF = df.groupBy("columnName").count()
newDF.show()
it will show you the frequency count for unique entries.