Translate SQL query to PySpark DataFrame query (a Percentile Ranking calculation) - pyspark

I'm trying to translate this SQL query to PySpark DataFrame methods:
SELECT id_profile, indications, PERCENT_RANK()
OVER (PARTITION BY id_profile ORDER BY prediction DESC) AS rank FROM predictions
So id_profile, indications and prediction are columns from my predictions DataFrame.
I think I have to do this with Window methods, but I can't figure out how.

Try this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy("id_profile").orderBy(F.col("prediction").desc())
df.withColumn("rank", F.percent_rank().over(w))\
.select("id_profile","indications","rank")

Related

How can I optimize calculation of mean in pyspark while ignoring Null values

I have a Pyspark Dataframe with around 4 billion rows, so efficiency in operations is very important. What I want to do seems very simple. I want to calculate the average value from two columns, and if one of them is Null I want to only return the non-null value. In Python I could easily accomplish this using np.nanmean, but I do not believe anything similar is implemented in Pyspark.
To clarify the behavior I am expecting, please see the below example rows:
user_id col_1 col_2 avg_score
1 32 12 22
2 24 None 24
Below is my currently implementation. Note that all values in col_1 are guaranteed to be non-null. I believe this can probably be further optimized:
from pyspark.sql import functions as f_
spark_df = spark_df.na.fill(0, 'col_2')
spark_df = spark_df.withColumn('avg_score',
sum([spark_df[i] for i in ['col_1','col_2']) /
sum([f_.when(spark_df[i] > 0, 1).otherwise(0) for i in ['col_1','col_2']]))
If anyone has any suggestions for whether there is a more efficient way to calculate this I would really appreciate it.

pyspark: evaluate the sum of all elements in a dataframe

I am trying to evaluate, in pyspark, the sum of all elements of a dataframe. I wrote the following function
def sum_all_elements(df):
df = df.groupBy().sum()
df = df.withColumn('total', sum(df[colname] for colname in df.columns))
return df.select('total').collect()[0][0]
To speed up the function I have tried to convert to rdd and sum as
def sum_all_elements_pyspark(df):
res = df.rdd.map(lambda x: sum(x)).sum()
return res
But apparently the rdd function is slower than the dataframe's one. Is there a way to speed up the rdd function?
Dataframe functions are faster than rdd as Catalyst optimizer optimizes the actions performed over the dataframes but it doesn't do the same for rdd's.
WHen you execute actions over dataframe api it generates a optimized logical plan and that optimized logical plan is converted into multiple physical plans which then goes through the cost based optimization and choosing the best physical plan.
Now, the final physical plan is rdd equivalent code to execute because at low level rdd's are used.
So using dataframe api based function will provide you the required performance boost.

Spark Dataset - Average function

I'm using Spark with Scala, and trying to find the best way to group Dataset by key, and get average + sum together.
For example,
I have Dataset[Player] , and Player consists of: playerId , yearSignup, level , points.
I want to group this dataset by yearSignup , and to calculate for every year: sum of points, and average level.
So with groupByKey(p=>p.yearSignup) , and reduceGroups(p1,p2) , I can get the sum of points: (p1.points ++ p2.points) with reduceLeft.
But how to get the average level? Should I sum it first, and after that group again and divide it?
Or there is another way to do it together.
After you groupby, you can use .agg for both sum and avg. (see docs)
import org.apache.spark.sql.functions._
Player
.groupBy($"yearSignup")
.agg(
avg($"level").as("avg_level")
sum($"points").as("total_points")
)

efficient aggregation (sum) on a single column Data Frame in spark scala

I have a Spark Data Frame with a single column and large number of rows (in billions). I am trying to calculate the sum of the values in each row using the code shown below. However, it is very slow. Is there an efficient way to calculate the sum?
val df = sc.parallelize(Array(1,3,5,6,7,10,30)).toDF("colA")
df.show()
df.agg(sum("colA")).first().get(0) //very slow
Similar query was posted here: How to sum the values of one column of a dataframe in spark/scala
The focus of this query is however about efficiency.

Looking for a way to Calculate Frequency distribution of a dataframe in spark/scala

I want to calculate the frequency distribution(return most common element in each column and the number of times it appeared) of a dataframe using spark and scala. I've tried using DataFrameStatFunctions library but after I filter my dataframe for only numeric type columns, I cant apply any functions from the library. Is the best way to do this to create a UDF?
you can use
val newDF = df.groupBy("columnName").count()
newDF.show()
it will show you the frequency count for unique entries.