Pyspark groupby and count null values - pyspark

PySpark Dataframe Groupby and Count Null Values
Referring to the solution link above, I am trying to apply the same logic but groupby("country") and getting the null count of another column, and I am getting a "column is not iterable" failure. Can someone help with this?
df7.groupby("country").agg(*(sum(col(c).isNull().cast("int")).alias(c) for c in columns))

covid_india_df.select(
[
funcs.count(
funcs.when((funcs.isnan(clm) | funcs.col(clm).isNull()), clm)
).alias(clm) for clm in covid_india_df.columns
]
).show()
The above approach may help you to get correct results. Check here for a complete example.

Related

Pyspark to get the column names of the columns that contains null values

I've a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
df.select([c for c in tbl_columns_list if df.filter(F.col(c).isNull()).count() > 0]).columns
I have almost 500 columns in my dataframe and when I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please? I need optimized solution in Pyspark please. Thanks in advance.

How to do a groupby rank and add it as a column to existing dataframe in spark scala?

Currently this is what I'm doing:
val new_df= old_df.groupBy("column1").count().withColumnRenamed("count","column1_count")
val new_df_rankings = new_df.withColumn(
"column1_count_rank",
dense_rank()
.over(
Window.orderBy($"column1_count".desc))).select("column1_count","column1_count_rank")
But really all I'm looking to do is add a column to the original df (old_df) called "column1_count_rank" without going through all these intermediate steps and merging back.
Is there a way to do this?
Thanks and have a great day!
As you apply aggregation, there will be a calculative result, it will create new dataframe.
Can you give some sample input and output example
old_df.groupBy("column1").agg(count("*").alias("column1_count")) .withColumn("column1_count_rank",dense_rank().over(Window.orderBy($"column1_count".desc))) .select("column1_count","column1_count_rank")

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

I have a dataset which is getting grouped by multiple variables where we finding aggregates like mean , std dev etc. Now i want to find Mean of a variables excluding the top 1 percentile of data
I am trying something like
df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),expr('percentile(value, array(0.99))')[0].alias('99_percentile'),mean(when(col('value')<=col('99_percentile'),col('value')))
But it seems spark cannot use the agg name which is defined in the same group statement.
I even tried this ,
~df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),mean(when(col('value')<=expr('percentile(value, array(0.99))')[0],col('value')))~
But it throws below error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
I hope some one would be able to answer this
Update :
I try doing the otherway
Here's a straightforward modification of your code. It will aggregate df twice. As far as I can tell, that's what is required.
df_final=(
df.join(df
.groupby(groupbyElement)
.agg(expr('percentile(value, array(0.99))')[0].alias('99_percentile'),
on=["groupbyElement"], how="left"
)
.groupby(groupbyElement)
.agg(mean('value').alias('Mean'),
stddev('value').alias('Stddev'),
mean(when(col('value')<=col('99_percentile'), col('value')))
)

Handling NULL values in Pyspark in Column expression

I have been scratching my head with a problem in pyspark.
I want to conditionally apply a UDF on a column depending on if it is NULL or not. One constraint is that I do not have access to the DataFrame at the location I am writing the code I only have access to a column object.
Thus, I cannot simply do:
df.where(my_col.isNull()).select(my_udf(my_col)).toPandas()
Therefore, having only access to a Column object, I was writing the following:
my_res_col = F.when(my_col.isNull(), F.lit(0.0) \
.otherwise(my_udf(my_col))
And then later do:
df.select(my_res_col).toPandas()
Unfortunately for some reason that I do not know, I sill receive NULLs in my UDF, forcing me to check for NULL values directly in my UDF.
I do not understand why the isNull() is not preventing rows with NULL values from calling the UDF.
Any insight on this matter would be greatly appreciated.
I thank you in advance for your help.
Antoine
I am not sure about your data. does it contains nan? spark handles null and nan differently.
Differences between null and NaN in spark? How to deal with it?
so can you just try the below and check if it solves
import pyspark.sql.functions as F
my_res_col = F.when(((my_col.isNull())|(F.isnan(mycol))), F.lit(0.0)).otherwise(my_udf(my_col))

groupBy on dataframe in scala

I am trying to do groupBy on dataframe with two columns whereas second column has a category value. Plz help me for the correct syntax in scala .
I tried this way but its wrong.
df.groupBy("col1", "col2" == "Buy").count
Thanks.