Unsure how to apply row-wise normalization on pyspark dataframe - pyspark

Disclaimer: I'm a beginner when it comes to Pyspark.
For each cell in a row, I'd like to apply the following function
new_col_i = col_i / max(col_1,col_2,col_3,...,col_n)
At the very end, I'd like the range of values to go from 0.0 to 1.0.
Here are the details of my dataframe:
Dimensions: (6.5M, 2905)
Dtypes: Double
Initial DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 7.5| 0.1| 2.0|
| 2| 0.3| 3.5| 10.5|
+-----+-------+-------+-------+
Updated DF:
+-----+-------+-------+-------+
|. id| col_1| col_2| col_n |
+-----+-------+-------+-------+
| 1| 1.0| 0.013| 0.26|
| 2| 0.028| 0.33| 1.0|
+-----+-------+-------+-------+
Any help would be appreciated.

You can find the maximum value from an array of columns and loop your dataframe to replace the normalized column value.
cols = df.columns[1:]
import builtins as p
df2 = df.withColumn('max', array_max(array(*[col(c) for c in cols]))) \
for c in cols:
df2 = df2.withColumn(c, col(c) / col('max'))
df2.show()
+---+-------------------+--------------------+-------------------+----+
| id| col_1| col_2| col_n| max|
+---+-------------------+--------------------+-------------------+----+
| 1| 1.0|0.013333333333333334|0.26666666666666666| 7.5|
| 2|0.02857142857142857| 0.3333333333333333| 1.0|10.5|
+---+-------------------+--------------------+-------------------+----+

Related

How to add a list of data into dataschame in pyspark?

there is a dataframe contains two columns, one is key and another is value. Like below:
+--------------------+--------------------+
| key| value |
+--------------------+--------------------+
|a |abcde |
I want to slice the value into mutiple values with position and generate a new dataframe following the key. Like below:
+--------------------+--------------------+
| key| value|
+--------------------+--------------------+
|a |[a, 0] |
|a |[b, 1] |
|a |[c, 2] |
|a |[d, 3] |
|a |[e, 4] |
I have tried to use join() and StructType() but I failed. Are there any possible method to do that? THANKS!
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("a","abcd",3000)]
df = spark.createDataFrame(data,["Name","Department","Salary"])
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (value, index)") , f.col("Name")).show()
+-----+-----+----+
|value|index|Name|
+-----+-----+----+
| 0| a| a|
| 1| b| a|
| 2| c| a|
| 3| d| a|
+-----+-----+----+
# to explain the above functions more clearly.
f.expr( "..." ) #--> write a SQL expression
"posexplode( [..] )" #for an array turn them into rows with their index
"filter( [..] , x -> x != '' )" #for an array filter out ''
"split( Department, '' )" #split the column on null (extract characters) which will add null in out array that we need to filter out.
Here's the update to fit with your exact request, just a little more manipulation to put it into your required format:
df.select( f.expr( "posexplode( filter (split(Department,''), x -> x != '' ) ) as (myvalue, index)") , f.col("Name"),f.expr( "array(myvalue,index) as value ")).drop("index","myvalue").show()
+----+------+
|Name| value|
+----+------+
| a|[0, a]|
| a|[1, b]|
| a|[2, c]|
| a|[3, d]|
+----+------+
The following snippet transform your data into your specified format:
import pyspark.sql.functions as F
df = spark.createDataFrame([("a", "abcde",)], ["key", "value"])
df_split = df.withColumn("split", F.array_remove(F.split("value", ""), ""))
df_split.show()
df_exploded = df_split.select("key", F.posexplode("split"))
df_exploded.show()
df_array = df_exploded.select("key", F.array("col", "pos").alias("value"))
df_array.show()
Output:
+---+-----+---------------+
|key|value| split|
+---+-----+---------------+
| a|abcde|[a, b, c, d, e]|
+---+-----+---------------+
+---+---+---+
|key|pos|col|
+---+---+---+
| a| 0| a|
| a| 1| b|
| a| 2| c|
| a| 3| d|
| a| 4| e|
+---+---+---+
+---+------+
|key| value|
+---+------+
| a|[a, 0]|
| a|[b, 1]|
| a|[c, 2]|
| a|[d, 3]|
| a|[e, 4]|
+---+------+
First, the string is split into an array, where the split pattern is the empty string. Therefore, the last element needs to be removed.
Then each element of a the split column array is transformed to a row with its position in the array as column pos
Lastly, the columns are combined to an array.

PySpark: Pandas UDF for scipy statistical transformations

I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working.
Here's my example:
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import zscore
#pandas_udf('float')
def zscore_udf(x: pd.Series) -> pd.Series:
return zscore(x)
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = ["id","x"]
data = [("a", 81.0),
("b", 36.2),
("c", 12.0),
("d", 81.0),
("e", 36.3),
("f", 12.0),
("g", 111.7)]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
df = df.withColumn('y', zscore_udf(df.x))
df.show()
Which results in obviously wrong calculations:
+---+-----+----+
| id| x| y|
+---+-----+----+
| a| 81.0|null|
| b| 36.2| 1.0|
| c| 12.0|-1.0|
| d| 81.0| 1.0|
| e| 36.3|-1.0|
| f| 12.0|-1.0|
| g|111.7| 1.0|
+---+-----+----+
Thank you for your help.
How to fix:
instead of using a UDF calculate the stddev_pop and the avg of the dataframe and calculate z-score manually.
I suggest using "window function" over the entire dataframe for the first step and then a simple arithmetic to get the z-score.
see suggested code:
from pyspark.sql.functions import avg, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
avg("x").over(Window.partitionBy()).alias("avg_x"),
stddev_pop("x").over(Window.partitionBy()).alias("stddev_x"),
) \
.withColumn("manual_z_score", (col("x") - col("avg_x")) / col("stddev_x"))
Why the UDF didn't work?
Spark is used for distributed computation. When you perform operations on a DataFrame Spark distributes the workload into partitions on the executors/workers available.
pandas_udf is not different. When running a UDF from the type pd.Series -> pd.Series some rows are sent to partition X and some to partition Y, then when zscore is run it calculates the mean and std of the data in the partition and writes the zscore based on that data only.
I'll use spark_partition_id to "prove" this.
rows a,b,c were mapped in partition 0 while d,e,f,g in partition 1. I've calculated manually the mean/stddev_pop of both the entire set and the partitioned data and then calculated the z-score. the UDF z-score was equal to the z-score of the partition.
from pyspark.sql.functions import pandas_udf, spark_partition_id, avg, stddev, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
zscore_udf(df.x).alias("z_score"),
spark_partition_id().alias("partition"),
avg("x").over(Window.partitionBy(spark_partition_id())).alias("avg_partition_x"),
stddev_pop("x").over(Window.partitionBy(spark_partition_id())).alias("stddev_partition_x"),
) \
.withColumn("partition_z_score", (col("x") - col("avg_partition_x")) / col("stddev_partition_x"))
df2.show()
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| id| x| z_score|partition| avg_partition_x|stddev_partition_x| partition_z_score|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| a| 81.0| 1.327058| 0|43.06666666666666|28.584533502500186| 1.3270579815484989|
| b| 36.2|-0.24022315| 0|43.06666666666666|28.584533502500186|-0.24022314955974558|
| c| 12.0| -1.0868348| 0|43.06666666666666|28.584533502500186| -1.0868348319887526|
| d| 81.0| 0.5366879| 1| 60.25|38.663063768925504| 0.5366879387524718|
| e| 36.3|-0.61945426| 1| 60.25|38.663063768925504| -0.6194542714757446|
| f| 12.0| -1.2479612| 1| 60.25|38.663063768925504| -1.247961110593097|
| g|111.7| 1.3307275| 1| 60.25|38.663063768925504| 1.3307274433163698|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
I also added df.repartition(8) prior to the calculation and managed to get similar results as in the original question.
partitions with 0 stddev --> null z score, partition with 2 rows --> (-1, 1) z scores.
+---+-----+-------+---------+---------------+------------------+-----------------+
| id| x|z_score|partition|avg_partition_x|stddev_partition_x|partition_z_score|
+---+-----+-------+---------+---------------+------------------+-----------------+
| a| 81.0| null| 0| 81.0| 0.0| null|
| d| 81.0| null| 0| 81.0| 0.0| null|
| f| 12.0| null| 1| 12.0| 0.0| null|
| b| 36.2| -1.0| 6| 73.95| 37.75| -1.0|
| g|111.7| 1.0| 6| 73.95| 37.75| 1.0|
| c| 12.0| -1.0| 7| 24.15|12.149999999999999| -1.0|
| e| 36.3| 1.0| 7| 24.15|12.149999999999999| 1.0|
+---+-----+-------+---------+---------------+------------------+-----------------+

Calculate running total in Pyspark dataframes and break the loop when a condition occurs

I have a spark dataframe, where I need to calculate a running total based on the current and previous row sum of amount valued based on the col_x. when ever there is occurance of negative amount in col_y, I should break the running total of previous records, and start doing the running total from current row.
Sample dataset:
The expected output should be like:
How to acheive this with dataframes using pyspark?
Another way
Create Index
df = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
Regenerate Columns
df = df.select('index', 'value.*')#.show()
Create groups bounded by negative values
w=Window.partitionBy().orderBy('index').rowsBetween(-sys.maxsize,0)
df=df.withColumn('cat', f.min('Col_y').over(w))
Cumsum within groups
y=Window.partitionBy('cat').orderBy(f.asc('index')).rowsBetween(Window.unboundedPreceding,0)
df.withColumn('cumsum', f.round(f.sum('Col_y').over(y),2)).sort('index').drop('cat','index').show()
Outcome
+-----+-------------------+------+
|Col_x| Col_y|cumsum|
+-----+-------------------+------+
| ID1|-17.899999618530273| -17.9|
| ID1| 21.899999618530273| 4.0|
| ID1| 236.89999389648438| 240.9|
| ID1| 4.989999771118164|245.89|
| ID1| 610.2000122070312|856.09|
| ID1| -35.79999923706055| -35.8|
| ID1| 21.899999618530273| -13.9|
| ID1| 17.899999618530273| 4.0|
+-----+-------------------+------+
I am hoping in real scenario you will be having a timestamp column to do ordering of the data, I am ordering the data using line number with zipindex for the explanation basis here.
from pyspark.sql.window import Window
import pyspark.sql.functions as f
from pyspark.sql.functions import *
from pyspark.sql.types import *
data = [
("ID1", -17.9),
("ID1", 21.9),
("ID1", 236.9),
("ID1", 4.99),
("ID1", 610.2),
("ID1", -35.8),
("ID1",21.9),
("ID1",17.9)
]
schema = StructType([
StructField('Col_x', StringType(),True), \
StructField('Col_y', FloatType(),True)
])
df = spark.createDataFrame(data=data, schema=schema)
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("valuewithorder")
w = Window.partitionBy('Col_x').orderBy('index')
w1 = Window.partitionBy('Col_x','group').orderBy('index')
df_final=spark.sql("select value.Col_x,round(value.Col_y,1) as Col_y, index from valuewithorder")
"""Group The data into different groups based on the negative value existance"""
df_final = df_final.withColumn("valueChange",(f.col('Col_y')<0).cast("int")) \
.fillna(0,subset=["valueChange"])\
.withColumn("indicator",(~((f.col("valueChange") == 0))).cast("int"))\
.withColumn("group",f.sum(f.col("indicator")).over(w.rangeBetween(Window.unboundedPreceding, 0)))
"""Cumlative sum with idfferent parititon of group and col_x"""
df_cum_sum = df_final.withColumn("Col_z", sum('Col_y').over(w1))
df_cum_sum.createOrReplaceTempView("FinalCumSum")
df_cum_sum = spark.sql("select Col_x , Col_y ,round(Col_z,1) as Col_z from FinalCumSum")
df_cum_sum.show()
Results of intermedite data set and results
>>> df_cum_sum.show()
+-----+-----+-----+
|Col_x|Col_y|Col_z|
+-----+-----+-----+
| ID1|-17.9|-17.9|
| ID1| 21.9| 4.0|
| ID1|236.9|240.9|
| ID1| 5.0|245.9|
| ID1|610.2|856.1|
| ID1|-35.8|-35.8|
| ID1| 21.9|-13.9|
| ID1| 17.9| 4.0|
+-----+-----+-----+
>>> df_final.show()
+-----+-----+-----+-----------+---------+-----+
|Col_x|Col_y|index|valueChange|indicator|group|
+-----+-----+-----+-----------+---------+-----+
| ID1|-17.9| 0| 1| 1| 1|
| ID1| 21.9| 1| 0| 0| 1|
| ID1|236.9| 2| 0| 0| 1|
| ID1| 5.0| 3| 0| 0| 1|
| ID1|610.2| 4| 0| 0| 1|
| ID1|-35.8| 5| 1| 1| 2|
| ID1| 21.9| 6| 0| 0| 2|
| ID1| 17.9| 7| 0| 0| 2|
+-----+-----+-----+-----------+---------+-----+

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

PySpark DataFrame multiply columns based on values in other columns

Pyspark newbie here. I have a dataframe, say,
+------------+-------+----+
| id| mode|count|
+------------+------+-----+
| 146360 | DOS| 30|
| 423541 | UNO| 3|
+------------+------+-----+
I want a dataframe with a new column aggregate with count * 2 , when mode is 'DOS' and count * 1 when mode is 'UNO'
+------------+-------+----+---------+
| id| mode|count|aggregate|
+------------+------+-----+---------+
| 146360 | DOS| 30| 60|
| 423541 | UNO| 3| 3|
+------------+------+-----+---------+
Appreciate your inputs and also some pointers to best practices :)
Method 1: using pyspark.sql.functions with when :
from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))
Method 2: using SQL CASE expression with selectExpr:
df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")
The result:
+------+----+-----+---------+
| id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS| 30| 60|
|423541| UNO| 3| 3|
+------+----+-----+---------+