Pivot by year and also get the sum of all amounts Pyspark

Pivot by year and also get the sum of all amounts Pyspark - pyspark

I have data like this
I want output like this
How do I achieve this?

One way of doing is: pivot, create an array and sum values within the array
from pyspark.sql.functions import *
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("reduce(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()
OR use pysparks inbuilt aggregate function
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("aggregate(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()

Related

Pyspark DataFrame - Discretize the selected numerical column and then apply groupby and crosstab function

I have dataframe which has 100+ numerical columns. I want to descretize some columns from it and then apply groupby function and crosstab function on these discretized columns.
Currently, I am using a loop to iterate over all selected numerical columns. But it is very time-consuming. is there any better and cleaner solution? My code looks like below:
from pyspark.ml.feature import QuantileDiscretizer
df_num = spark.createDataFrame(data = [],schema = StructType([]))
for name in number_columns:
steps = QuantileDiscretizer(numBuckets=10,inputCol=name,outputCol=name+'Bin')
Selected_data=steps.fit(Selected_data).transform(Selected_data)
tmp=Selected_data.groupBy(name+'Bin').agg(mean("ABC"),mean("XYZ"),count("ABC"),count("XYZ")).withColumnRenamed(name+'Bin','Category')
temp=Selected_data.crosstab(name+'Bin', 'code').withColumnRenamed(name+'Bin_code','Category')
temp=temp.join(tmp, 'Category','inner')
df_num=df_num.unionByName(temp,allowMissingColumns=True)

From pyspark agg function to int

I am counting rows by a condition on pyspark
df.agg(count(when((col("my_value")==0),True))).show()
It works as I expected. Then how can I extract the value showed in the table to store to a Python variable?

If you just want to count the Trues (ceros), you should better do this:
from pyspark.sql import functions as F
pythonVariable = df.where(F.col('my_value') == 0).collect[0][0]
As you can see, there is no need to change the ceros to True to count them.

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976

Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks

You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?

Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.

You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))

You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark