From pyspark agg function to int - pyspark

I am counting rows by a condition on pyspark
df.agg(count(when((col("my_value")==0),True))).show()
It works as I expected. Then how can I extract the value showed in the table to store to a Python variable?

If you just want to count the Trues (ceros), you should better do this:
from pyspark.sql import functions as F
pythonVariable = df.where(F.col('my_value') == 0).collect[0][0]
As you can see, there is no need to change the ceros to True to count them.

Related

Pivot by year and also get the sum of all amounts Pyspark

I have data like this
I want output like this
How do I achieve this?
One way of doing is: pivot, create an array and sum values within the array
from pyspark.sql.functions import *
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("reduce(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()
OR use pysparks inbuilt aggregate function
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("aggregate(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()

Call function on Dataframe's columns has error TypeError: Column is not iterable

I am using Databricks with Spark 2.4. and i am coding Python
I have created this function to convert null to empty string
def xstr(s):
if s is None:
return ""
return str(s)
Then I have below code
from pyspark.sql.functions import *
lv_query = """
SELECT
SK_ID_Site, Designation_Site
FROM db_xxx.t_xxx
ORDER BY SK_ID_Site
limit 2"""
lvResult = spark.sql(lv_query)
a = lvResult1.select(map(xstr, col("Designation_Site")))
display(a)
I have this error : TypeError: Column is not iterable
what i need to do here is to call a function for each row that i have in my Dataframe. i would like to pass columns as parameters and have a result.
That's not how spark works. You cannot apply direct python code to a spark dataframe content.
There are already builtin functions that do the job for you.
from pyspark.sql import functions as F
a = lvResult1.select(
F.when(F.col("Designation_Site").isNull(), "").otherwise(
F.col("Designation_Site").cast("string")
)
)
In case you want some more complex functions that you cannot do with the builtin functions, you can use an UDF but it may impact a lot your performances (better check for existing builtin functions before building your own UDF).

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976
Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark