How to create this function in PySpark? - pyspark

I have a large data frame, consisting of 400+ columns and 14000+ records, that I need to clean.
I have defined a python code to do this, but due to the size of my dataset, I need to use PySpark to clean it. However, I am very unfamiliar with PySpark and don't know how I would create the python function in PySpark.
This is the function in python:
unwanted_characters = ['[', ',', '-', '#', '#', ' ']
cols = df.columns.to_list()
def clean_col(item):
column= str(item.loc[col])
for character in unwanted_characters:
if character in column:
character_index = column.find(character)
column = column[:character_index]
return column
for x in cols:
df[x] = lrndf.apply(clean_col, axis=1)
This function works in python but I cannot apply it to 400+ columns.
I have tried to convert this funtion to pyspark:
clean_colUDF = udf(lambda z: clean_col(z))
df.select(col("Name"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)
But when I run it I get the error:
AttributeError: 'str' object has no attribute 'loc'
Does anyone know how I would modify this so that it works in pyspark?
My columns datatypes are both integers and strings so I need it to work on both.

Use built-in pyspark.sql.functions wherever possible as they provide a ready-made performant toolkit which should be able to cover 95% of any data transformation requirement without having to implement your own custom UDF's
pyspark.sql.functions docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html
For what you want to do I would start with regex_replace()
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_replace.html#pyspark.sql.functions.regexp_replace

Related

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

How do you use aggregated values within PySpark SQL when() clause?

I am trying to learn PySpark, and have tried to learn how to use SQL when() clauses to better categorize my data. (See here: https://sparkbyexamples.com/spark/spark-case-when-otherwise-example/) What I can't seem to get addressed is how to insert actual scalar values into the when() conditions for comparison's sake explicitly. It seems the aggregate functions return more tabular values than actual float() types.
I keep getting this error message unsupported operand type(s) for -: 'method' and 'method' When I tried running functions to aggregate another column in the original data frame I noticed the result didn't seem to be a flat scaler as much as a table (agg(select(f.stddev("Col")) gives a result like: "DataFrame[stddev_samp(TAXI_OUT): double]") Here is a sample of what I am trying to accomplish if you want to replicate, and I was wondering how you might get aggregate values like the standard deviation and mean within the when() clause so you can use that to categorize your new column:
samp = spark.createDataFrame(
[("A","A1",4,1.25),("B","B3",3,2.14),("C","C2",7,4.24),("A","A3",4,1.25),("B","B1",3,2.14),("C","C1",7,4.24)],
["Category","Sub-cat","quantity","cost"])
psMean = samp.agg({'quantity':'mean'})
psStDev = samp.agg({'quantity':'stddev'})
psCatVect = samp.withColumn('quant_category',.when(samp['quantity']<=(psMean-psStDev),'small').otherwise('not small')) ```
psMean and psStdev in your example are dataframes, you need to use collect() method to extract the scalar values
psMean = samp.agg({'quantity':'mean'}).collect()[0][0]
psStDev = samp.agg({'quantity':'stddev'}).collect()[0][0]
You could also create one variable with all stats as pandas DataFrame and reference to it later in pyspark code:
from pyspark.sql import functions as F
stats = (
samp.select(
F.mean("quantity").alias("mean"),
F.stddev("quantity").alias("std")
).toPandas()
)
(
samp.withColumn('quant_category',
F.when(
samp['quantity'] <= stats["mean"].item() - stats["std"].item(),
'small')
.otherwise('not small')
)
.toPandas()
)

Call function on Dataframe's columns has error TypeError: Column is not iterable

I am using Databricks with Spark 2.4. and i am coding Python
I have created this function to convert null to empty string
def xstr(s):
if s is None:
return ""
return str(s)
Then I have below code
from pyspark.sql.functions import *
lv_query = """
SELECT
SK_ID_Site, Designation_Site
FROM db_xxx.t_xxx
ORDER BY SK_ID_Site
limit 2"""
lvResult = spark.sql(lv_query)
a = lvResult1.select(map(xstr, col("Designation_Site")))
display(a)
I have this error : TypeError: Column is not iterable
what i need to do here is to call a function for each row that i have in my Dataframe. i would like to pass columns as parameters and have a result.
That's not how spark works. You cannot apply direct python code to a spark dataframe content.
There are already builtin functions that do the job for you.
from pyspark.sql import functions as F
a = lvResult1.select(
F.when(F.col("Designation_Site").isNull(), "").otherwise(
F.col("Designation_Site").cast("string")
)
)
In case you want some more complex functions that you cannot do with the builtin functions, you can use an UDF but it may impact a lot your performances (better check for existing builtin functions before building your own UDF).

pyspark udf with parameter

Need to transfer one pyspark dataframe colume checkin_time from milisec to timezone adjusted timestamp, timezone information is in another column tz_info.
Tried following:
def tz_adjust(x,tz_info):
if tz_info:
y = col(x)+ col(tz_info)
return from_unixtime(col(y)/1000)
else:
return from_unixtime(col(x)/1000)
def udf_tz_adjust(tz_info):
return udf(lambda l: tz_adjust(l, tz_info))
While using this udf to the column
df.withColumn('checkin_time',udf_tz_adjust('time_zone')(col('checkin_time')))
got some error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Any idea to pass the second column as parameter to udf?
Thanks.
IMHO, what you are doing is a combination of UDF and partial function which could get tricky. I don't think you need to use UDF at all for your application purpose. You can do the following
#not tested
from pyspark.sql.functions import *
df.withColumn('checkin_time', when(col("tz_info").isNotNull(), (from_unixtime(col('checkin_time')) + F.col("tz_info"))/1000).otherwise(from_unixtime(col("checkin_time"))/1000))
UDF has its own serde inefficiencies which is even worse when using with python as it puts an extra overhead of converting scala datatypes into python datatypes.

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .