I'm using Spark with Scala, and trying to find the best way to group Dataset by key, and get average + sum together.
For example,
I have Dataset[Player] , and Player consists of: playerId , yearSignup, level , points.
I want to group this dataset by yearSignup , and to calculate for every year: sum of points, and average level.
So with groupByKey(p=>p.yearSignup) , and reduceGroups(p1,p2) , I can get the sum of points: (p1.points ++ p2.points) with reduceLeft.
But how to get the average level? Should I sum it first, and after that group again and divide it?
Or there is another way to do it together.
After you groupby, you can use .agg for both sum and avg. (see docs)
import org.apache.spark.sql.functions._
Player
.groupBy($"yearSignup")
.agg(
avg($"level").as("avg_level")
sum($"points").as("total_points")
)
Related
New to PySpark and would like to make a table that counts the unique pairs of values from two columns and shows the average of another column over all rows with those pairs of values. My code so far is:
df1 = df.withColumn('trip_rate', df.total_amount / df.trip_distance)
df1.groupBy('PULocationID', 'DOLocationID').count().orderBy('count', ascending=False).show()
I want to add the average of the trip rate for each unique pair as a column. Can you help me please?
I have calculated Z Score at row level following way.
AVG = window_avg(sum(measure))
STDEV = window_stdevp(sum(measure)
ZSCORE = (sum(measure) - AVG) / STDEVP
This works nicely and I have zscore for each row if data is aggregated at the level. I would now like to aggregate my data to monthly but exclude all the rows outside of -2 - 2 zscore range. When I add zscore to filter, it is already at monthly level and won't filter individual rows.
How can I change this so aggregation only includes rows with correct zscore?
I have a large data-set and used this code to group different products and different models of every product.
val tt2 = dTestSample1.groupBy("Product", "model" )
.agg( count("Product") as "countItems" )
.withColumn("percentage", (col("countItems") / sum("countItems").over())* 100)
.sort("Product")
So far, the results are accurate in this table.
Can any one help me to improve the code to be able to calculate percentages of every model out of the products?
To clarify the idea this table was done manually and can be taken as an example.
Sounds like you're looking for a windowing function.
val winSpec = Window.partitionBy("product")
df.withColumn("totalPerProduct", sum("countItems").over(winSpec)
Then you can calculate the percentages easily after that.
I have a spark (scala) dataframe "Marketing" with approx 17 columns with 1 of them as "Balance". The data type of this column is Int. I need to find the median Balance. I can do upto arranging it in ascending order, but how to proceed after that? I have a given hint that the percentile function of scala can be used. I don't have any idea about this percentile function. Can anyone help?
Median is the same thing as the 50th percentile. If you do not mind using hive functions you can do one of the following:
marketingDF.selectExpr("percentile(CAST(Balance AS BIGINT), 0.5) AS median")
If you do not need an exact figure you can look into using percentile_approx() instead.
Documentation for both functions is located here.
I want to calculate the frequency distribution(return most common element in each column and the number of times it appeared) of a dataframe using spark and scala. I've tried using DataFrameStatFunctions library but after I filter my dataframe for only numeric type columns, I cant apply any functions from the library. Is the best way to do this to create a UDF?
you can use
val newDF = df.groupBy("columnName").count()
newDF.show()
it will show you the frequency count for unique entries.