How do you use aggregated values within PySpark SQL when() clause? - pyspark

I am trying to learn PySpark, and have tried to learn how to use SQL when() clauses to better categorize my data. (See here: https://sparkbyexamples.com/spark/spark-case-when-otherwise-example/) What I can't seem to get addressed is how to insert actual scalar values into the when() conditions for comparison's sake explicitly. It seems the aggregate functions return more tabular values than actual float() types.
I keep getting this error message unsupported operand type(s) for -: 'method' and 'method' When I tried running functions to aggregate another column in the original data frame I noticed the result didn't seem to be a flat scaler as much as a table (agg(select(f.stddev("Col")) gives a result like: "DataFrame[stddev_samp(TAXI_OUT): double]") Here is a sample of what I am trying to accomplish if you want to replicate, and I was wondering how you might get aggregate values like the standard deviation and mean within the when() clause so you can use that to categorize your new column:
samp = spark.createDataFrame(
[("A","A1",4,1.25),("B","B3",3,2.14),("C","C2",7,4.24),("A","A3",4,1.25),("B","B1",3,2.14),("C","C1",7,4.24)],
["Category","Sub-cat","quantity","cost"])
psMean = samp.agg({'quantity':'mean'})
psStDev = samp.agg({'quantity':'stddev'})
psCatVect = samp.withColumn('quant_category',.when(samp['quantity']<=(psMean-psStDev),'small').otherwise('not small')) ```

psMean and psStdev in your example are dataframes, you need to use collect() method to extract the scalar values
psMean = samp.agg({'quantity':'mean'}).collect()[0][0]
psStDev = samp.agg({'quantity':'stddev'}).collect()[0][0]

You could also create one variable with all stats as pandas DataFrame and reference to it later in pyspark code:
from pyspark.sql import functions as F
stats = (
samp.select(
F.mean("quantity").alias("mean"),
F.stddev("quantity").alias("std")
).toPandas()
)
(
samp.withColumn('quant_category',
F.when(
samp['quantity'] <= stats["mean"].item() - stats["std"].item(),
'small')
.otherwise('not small')
)
.toPandas()
)

Related

How to create this function in PySpark?

I have a large data frame, consisting of 400+ columns and 14000+ records, that I need to clean.
I have defined a python code to do this, but due to the size of my dataset, I need to use PySpark to clean it. However, I am very unfamiliar with PySpark and don't know how I would create the python function in PySpark.
This is the function in python:
unwanted_characters = ['[', ',', '-', '#', '#', ' ']
cols = df.columns.to_list()
def clean_col(item):
column= str(item.loc[col])
for character in unwanted_characters:
if character in column:
character_index = column.find(character)
column = column[:character_index]
return column
for x in cols:
df[x] = lrndf.apply(clean_col, axis=1)
This function works in python but I cannot apply it to 400+ columns.
I have tried to convert this funtion to pyspark:
clean_colUDF = udf(lambda z: clean_col(z))
df.select(col("Name"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)
But when I run it I get the error:
AttributeError: 'str' object has no attribute 'loc'
Does anyone know how I would modify this so that it works in pyspark?
My columns datatypes are both integers and strings so I need it to work on both.
Use built-in pyspark.sql.functions wherever possible as they provide a ready-made performant toolkit which should be able to cover 95% of any data transformation requirement without having to implement your own custom UDF's
pyspark.sql.functions docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html
For what you want to do I would start with regex_replace()
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_replace.html#pyspark.sql.functions.regexp_replace

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

Call function on Dataframe's columns has error TypeError: Column is not iterable

I am using Databricks with Spark 2.4. and i am coding Python
I have created this function to convert null to empty string
def xstr(s):
if s is None:
return ""
return str(s)
Then I have below code
from pyspark.sql.functions import *
lv_query = """
SELECT
SK_ID_Site, Designation_Site
FROM db_xxx.t_xxx
ORDER BY SK_ID_Site
limit 2"""
lvResult = spark.sql(lv_query)
a = lvResult1.select(map(xstr, col("Designation_Site")))
display(a)
I have this error : TypeError: Column is not iterable
what i need to do here is to call a function for each row that i have in my Dataframe. i would like to pass columns as parameters and have a result.
That's not how spark works. You cannot apply direct python code to a spark dataframe content.
There are already builtin functions that do the job for you.
from pyspark.sql import functions as F
a = lvResult1.select(
F.when(F.col("Designation_Site").isNull(), "").otherwise(
F.col("Designation_Site").cast("string")
)
)
In case you want some more complex functions that you cannot do with the builtin functions, you can use an UDF but it may impact a lot your performances (better check for existing builtin functions before building your own UDF).

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

finding the count, pass and fail percentage of the multiple values from single column while aggregating another column using pyspark

Data
I want to apply groupby for column1 and want to calculate the percentage of passed and failed percentage for each 1 and as well count
Example ouput I am looking for
Using pyspark I am doing the below code but I am only getting the percentage
levels = ["passed", "failed","blocked"]
exprs = [avg((col("Column2") == level).cast("double")*100).alias(level)
for level in levels]
df = sparkSession.read.json(hdfsPath)
result1 = df1.select('Column1','Column2').groupBy("Column1").agg(*exprs)
You would need to explicitly calculate the counts, and then do some string formatting to combine the percentages in the counts into a single column.
from pyspark.sql.functions import avg, col, count, concat, lit
levels = ["passed", "failed","blocked"]
# percentage aggregations
pct_exprs = [avg((col("Column2") == level).cast("double")*100).alias('{}_pct'.format(level))
for level in levels]
# count aggregations
count_exprs = [sum((col("Column2") == level).cast("int")).alias('{}_count'.format(level))
for level in levels]
# combine all aggregations
exprs = pct_exprs + count_exprs
# string formatting select expressions
select_exprs = [
concat(
col('{}_pct'.format(level)).cast('string'),
lit('('),
col('{}_count'.format(level)).cast('string'),
lit(')')
).alias('{}_viz'.format(level))
for level in levels
]
df = sparkSession.read.json(hdfsPath)
result1 = (
df1
.select('Column1','Column2')
.groupBy("Column1")
.agg(*exprs)
.select('Column1', *select_exprs)
)
NB: it seems like you are trying to use Spark to make a nice visualization of the results of your calculations, but I don't think Spark is well-suited for this task. If you have few enough records that you can see all of them at once, you might as well work locally in Pandas or something similar. And if you have enough records that using Spark makes sense, then you can't see all of them at once anyway so it doesn't matter too much whether they look nice.