PySpark - iterate rows of a Data Frame - pyspark

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks

You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

Related

How to create this function in PySpark?

I have a large data frame, consisting of 400+ columns and 14000+ records, that I need to clean.
I have defined a python code to do this, but due to the size of my dataset, I need to use PySpark to clean it. However, I am very unfamiliar with PySpark and don't know how I would create the python function in PySpark.
This is the function in python:
unwanted_characters = ['[', ',', '-', '#', '#', ' ']
cols = df.columns.to_list()
def clean_col(item):
column= str(item.loc[col])
for character in unwanted_characters:
if character in column:
character_index = column.find(character)
column = column[:character_index]
return column
for x in cols:
df[x] = lrndf.apply(clean_col, axis=1)
This function works in python but I cannot apply it to 400+ columns.
I have tried to convert this funtion to pyspark:
clean_colUDF = udf(lambda z: clean_col(z))
df.select(col("Name"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)
But when I run it I get the error:
AttributeError: 'str' object has no attribute 'loc'
Does anyone know how I would modify this so that it works in pyspark?
My columns datatypes are both integers and strings so I need it to work on both.
Use built-in pyspark.sql.functions wherever possible as they provide a ready-made performant toolkit which should be able to cover 95% of any data transformation requirement without having to implement your own custom UDF's
pyspark.sql.functions docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html
For what you want to do I would start with regex_replace()
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.regexp_replace.html#pyspark.sql.functions.regexp_replace

Databricks Flatten Nested JSON to Dataframe with PySpark

I am trying to Convert a nested JSON to a flattened DataFrame.
I have read in the JSON as follows:
df = spark.read.json("/mnt/ins/duedil/combined.json")
The resulting dataframe looks like the following:
I have made a start on flattening the dataframe as follows
display(df.select ("companyId","countryCode"))
The above will display the following
I would like to select 'fiveYearCAGR" under the following: "financials:element:amortisationOfIntangibles:fiveYearCAGR"
Can someone let me know how to add to the select statement to retrieve the fiveYearCAGR?
Your financials is an array so if you want to extract something within the financials, you need some array transformations.
One example is to use transform.
from pyspark.sql import functions as F
df.select(
"companyId",
"countryCode",
F.transform('financials', lambda x: x['amortisationOfIntangibles']['fiveYearCAGR']).alias('fiveYearCAGR')
)
This will return the fiveYearCAGR in an array. If you need to flatten it further, you can use explode/explode_outer.

From pyspark agg function to int

I am counting rows by a condition on pyspark
df.agg(count(when((col("my_value")==0),True))).show()
It works as I expected. Then how can I extract the value showed in the table to store to a Python variable?
If you just want to count the Trues (ceros), you should better do this:
from pyspark.sql import functions as F
pythonVariable = df.where(F.col('my_value') == 0).collect[0][0]
As you can see, there is no need to change the ceros to True to count them.

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976
Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark