PySpark: Pandas UDF for scipy statistical transformations - pyspark

I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working.
Here's my example:
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import zscore
#pandas_udf('float')
def zscore_udf(x: pd.Series) -> pd.Series:
return zscore(x)
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = ["id","x"]
data = [("a", 81.0),
("b", 36.2),
("c", 12.0),
("d", 81.0),
("e", 36.3),
("f", 12.0),
("g", 111.7)]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
df = df.withColumn('y', zscore_udf(df.x))
df.show()
Which results in obviously wrong calculations:
+---+-----+----+
| id| x| y|
+---+-----+----+
| a| 81.0|null|
| b| 36.2| 1.0|
| c| 12.0|-1.0|
| d| 81.0| 1.0|
| e| 36.3|-1.0|
| f| 12.0|-1.0|
| g|111.7| 1.0|
+---+-----+----+
Thank you for your help.

How to fix:
instead of using a UDF calculate the stddev_pop and the avg of the dataframe and calculate z-score manually.
I suggest using "window function" over the entire dataframe for the first step and then a simple arithmetic to get the z-score.
see suggested code:
from pyspark.sql.functions import avg, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
avg("x").over(Window.partitionBy()).alias("avg_x"),
stddev_pop("x").over(Window.partitionBy()).alias("stddev_x"),
) \
.withColumn("manual_z_score", (col("x") - col("avg_x")) / col("stddev_x"))
Why the UDF didn't work?
Spark is used for distributed computation. When you perform operations on a DataFrame Spark distributes the workload into partitions on the executors/workers available.
pandas_udf is not different. When running a UDF from the type pd.Series -> pd.Series some rows are sent to partition X and some to partition Y, then when zscore is run it calculates the mean and std of the data in the partition and writes the zscore based on that data only.
I'll use spark_partition_id to "prove" this.
rows a,b,c were mapped in partition 0 while d,e,f,g in partition 1. I've calculated manually the mean/stddev_pop of both the entire set and the partitioned data and then calculated the z-score. the UDF z-score was equal to the z-score of the partition.
from pyspark.sql.functions import pandas_udf, spark_partition_id, avg, stddev, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
zscore_udf(df.x).alias("z_score"),
spark_partition_id().alias("partition"),
avg("x").over(Window.partitionBy(spark_partition_id())).alias("avg_partition_x"),
stddev_pop("x").over(Window.partitionBy(spark_partition_id())).alias("stddev_partition_x"),
) \
.withColumn("partition_z_score", (col("x") - col("avg_partition_x")) / col("stddev_partition_x"))
df2.show()
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| id| x| z_score|partition| avg_partition_x|stddev_partition_x| partition_z_score|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| a| 81.0| 1.327058| 0|43.06666666666666|28.584533502500186| 1.3270579815484989|
| b| 36.2|-0.24022315| 0|43.06666666666666|28.584533502500186|-0.24022314955974558|
| c| 12.0| -1.0868348| 0|43.06666666666666|28.584533502500186| -1.0868348319887526|
| d| 81.0| 0.5366879| 1| 60.25|38.663063768925504| 0.5366879387524718|
| e| 36.3|-0.61945426| 1| 60.25|38.663063768925504| -0.6194542714757446|
| f| 12.0| -1.2479612| 1| 60.25|38.663063768925504| -1.247961110593097|
| g|111.7| 1.3307275| 1| 60.25|38.663063768925504| 1.3307274433163698|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
I also added df.repartition(8) prior to the calculation and managed to get similar results as in the original question.
partitions with 0 stddev --> null z score, partition with 2 rows --> (-1, 1) z scores.
+---+-----+-------+---------+---------------+------------------+-----------------+
| id| x|z_score|partition|avg_partition_x|stddev_partition_x|partition_z_score|
+---+-----+-------+---------+---------------+------------------+-----------------+
| a| 81.0| null| 0| 81.0| 0.0| null|
| d| 81.0| null| 0| 81.0| 0.0| null|
| f| 12.0| null| 1| 12.0| 0.0| null|
| b| 36.2| -1.0| 6| 73.95| 37.75| -1.0|
| g|111.7| 1.0| 6| 73.95| 37.75| 1.0|
| c| 12.0| -1.0| 7| 24.15|12.149999999999999| -1.0|
| e| 36.3| 1.0| 7| 24.15|12.149999999999999| 1.0|
+---+-----+-------+---------+---------------+------------------+-----------------+

Related

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?
Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+
I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+

Pyspark sort and get first and last

I used code belopw to sort based on one column. I am wondering how can I get the first element and last element in sorted dataframe?
group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(desc("count"))
The max and min functions need to have a group to work with, to circumvent the issue, you can create a dummy column as below, then call the max and min for the maximum and minimum values.
If that's all you need, you don't really need sort here.
from pyspark.sql import functions as F
df = spark.createDataFrame([("a", 0.694), ("b", -2.669), ("a", 0.245), ("a", 0.1), ("b", 0.3), ("c", 0.3)], ["n", "val"])
df.show()
+---+------+
| n| val|
+---+------+
| a| 0.694|
| b|-2.669|
| a| 0.245|
| a| 0.1|
| b| 0.3|
| c| 0.3|
+---+------+
df = df.groupby('n').count() #.sort(F.desc('count'))
df = df.withColumn('dummy', F.lit(1))
df.show()
+---+-----+-----+
| n|count|dummy|
+---+-----+-----+
| c| 1| 1|
| b| 2| 1|
| a| 3| 1|
+---+-----+-----+
df = df.groupBy('dummy').agg(F.min('count').alias('min'), F.max('count').alias('max')).drop('dummy')
df.show()
+---+---+
|min|max|
+---+---+
| 1| 3|
+---+---+

How get the percentage of totals for each count after a groupBy in pyspark?

Given the following DataFrame:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([['a',1],['b', 2],['a', 3]], ['category', 'value'])
df.show()
+--------+-----+
|category|value|
+--------+-----+
| a| 1|
| b| 2|
| a| 3|
+--------+-----+
I want to count the number of items in each category and provide a percentage of total for each count, like so
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
You can obtain the count and percentage/ratio of totals with the following
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df.groupBy('category').count()\
.withColumn('percentage', f.round(f.col('count') / f.sum('count')\
.over(Window.partitionBy()),3)).show()
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
The previous statement can be divided in steps. df.groupBy('category').count() produces the count:
+--------+-----+
|category|count|
+--------+-----+
| b| 1|
| a| 2|
+--------+-----+
then by applying window functions we can obtain the total count on each row:
df.groupBy('category').count().withColumn('total', f.sum('count').over(Window.partitionBy())).show()
+--------+-----+-----+
|category|count|total|
+--------+-----+-----+
| b| 1| 3|
| a| 2| 3|
+--------+-----+-----+
where the total column is calculated by adding together all the counts in the partition (a single partition that includes all rows).
Once we have count and total for each row we can calculate the ratio:
df.groupBy('category')\
.count()\
.withColumn('total', f.sum('count').over(Window.partitionBy()))\
.withColumn('percentage',f.col('count')/f.col('total'))\
.show()
+--------+-----+-----+------------------+
|category|count|total| percentage|
+--------+-----+-----+------------------+
| b| 1| 3|0.3333333333333333|
| a| 2| 3|0.6666666666666666|
+--------+-----+-----+------------------+
You can groupby and aggregate with agg:
import pyspark.sql.functions as F
df.groupby('category').agg(F.count('value') / df.count()).show()
Output:
+--------+------------------+
|category|(count(value) / 3)|
+--------+------------------+
| b|0.3333333333333333|
| a|0.6666666666666666|
+--------+------------------+
To make it nicer you can use:
df.groupby('category').agg(
(
F.round(F.count('value') / df.count(), 2)
).alias('ratio')
).show()
Output:
+--------+-----+
|category|ratio|
+--------+-----+
| b| 0.33|
| a| 0.67|
+--------+-----+
You can also use SQL:
df.createOrReplaceTempView('df')
spark.sql(
"""
SELECT category, COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
FROM df
GROUP BY category
"""
).show()

create column in pyspark based on conditons [duplicate]

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

How to add columns in pyspark dataframe dynamically

I am trying to add few columns based on input variable vIssueCols
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
from pyspark.sql.window import Window
vIssueCols=['jobid','locid']
vQuery1 = 'vSrcData2= vSrcData'
vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy")
for x in vIssueCols:
Query1=vQuery1+'.withColumn("'+x+'_prev",F.lag(vSrcData.'+x+').over(vWindow1))'
exec(vQuery1)
now above query will generate vQuery1 as below, and it is working, but
vSrcData2= vSrcData.withColumn("jobid_prev",F.lag(vSrcData.jobid).over(vWindow1)).withColumn("locid_prev",F.lag(vSrcData.locid).over(vWindow1))
Cant I write a query something like
vSrcData2= vSrcData.withColumn(x+"_prev",F.lag(vSrcData.x).over(vWindow1))for x in vIssueCols
and generate the columns with the loop statement. Some blog has suggested to add a udf and call that, But instead using udf I will use above executing string method.
You can build your query using reduce.
from pyspark.sql.functions import lag
from pyspark.sql.window import Window
from functools import reduce
#sample data
df = sc.parallelize([[1, 200, '1234', 'asdf'],
[1, 50, '2345', 'qwerty'],
[1, 100, '4567', 'xyz'],
[2, 300, '123', 'prem'],
[2, 10, '000', 'ankur']]).\
toDF(["vKey","vOrderBy","jobid","locid"])
df.show()
vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy")
#your existing processing
df1= df.\
withColumn("jobid_prev",lag(df.jobid).over(vWindow1)).\
withColumn("locid_prev",lag(df.locid).over(vWindow1))
df1.show()
#to-be processing
vIssueCols=['jobid','locid']
df2 = (reduce(
lambda r_df, col_name: r_df.withColumn(col_name+"_prev", lag(r_df[col_name]).over(vWindow1)),
vIssueCols,
df
))
df2.show()
Sample data:
+----+--------+-----+------+
|vKey|vOrderBy|jobid| locid|
+----+--------+-----+------+
| 1| 200| 1234| asdf|
| 1| 50| 2345|qwerty|
| 1| 100| 4567| xyz|
| 2| 300| 123| prem|
| 2| 10| 000| ankur|
+----+--------+-----+------+
Output:
+----+--------+-----+------+----------+----------+
|vKey|vOrderBy|jobid| locid|jobid_prev|locid_prev|
+----+--------+-----+------+----------+----------+
| 1| 50| 2345|qwerty| null| null|
| 1| 100| 4567| xyz| 2345| qwerty|
| 1| 200| 1234| asdf| 4567| xyz|
| 2| 10| 000| ankur| null| null|
| 2| 300| 123| prem| 000| ankur|
+----+--------+-----+------+----------+----------+
Hope this helps!