Handling NULL values in Pyspark in Column expression - pyspark

I have been scratching my head with a problem in pyspark.
I want to conditionally apply a UDF on a column depending on if it is NULL or not. One constraint is that I do not have access to the DataFrame at the location I am writing the code I only have access to a column object.
Thus, I cannot simply do:
df.where(my_col.isNull()).select(my_udf(my_col)).toPandas()
Therefore, having only access to a Column object, I was writing the following:
my_res_col = F.when(my_col.isNull(), F.lit(0.0) \
.otherwise(my_udf(my_col))
And then later do:
df.select(my_res_col).toPandas()
Unfortunately for some reason that I do not know, I sill receive NULLs in my UDF, forcing me to check for NULL values directly in my UDF.
I do not understand why the isNull() is not preventing rows with NULL values from calling the UDF.
Any insight on this matter would be greatly appreciated.
I thank you in advance for your help.
Antoine

I am not sure about your data. does it contains nan? spark handles null and nan differently.
Differences between null and NaN in spark? How to deal with it?
so can you just try the below and check if it solves
import pyspark.sql.functions as F
my_res_col = F.when(((my_col.isNull())|(F.isnan(mycol))), F.lit(0.0)).otherwise(my_udf(my_col))

Related

How to fill NA in PySpark like in Pandas

I want to replace NA values in PySpark, and basically I can
df.na.fill(0) #it works
BUT, I want to replace these values enduringly, it means like using INPLACE in pandas.
When I check nulls after executing code belowe, there is no changes at all.
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() #I use this code to check null in data set
Do you know how to entirely replace NA in PySpark? I want to achieve the same effect like with inplace in pandas (permanently replaced values).
I've used this code to repalce values:
df.na.fill(0)
and I've used code below to check the effect
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
Unfornutnatelly I can only display new replaced values without permanently replaced nulls in my data set.
To fill the null value in spark, you can just use the .fillna(), which is
df = df.fillna(0)
You can check this docs: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.fillna.html. However, in spark dataframe, if your filling value is not same as the column data type, even you use .fillna(), the column value will still be null.

Avoid loading into table dataframe that is empty

I am creating a process in spark scala within an ETL that checks for some events occurred during the ETL process. I start with an empty dataframe and if events occur this dataframe is filled with information ( a dataframe can't be filled it can only be joined with other dataframes with the same structure ). The thing is that at the end of the process, the dataframe that has been generated is loaded into a table but it can happen that the dataframe ends up being empty because no event has occured and I don't want to load a dataframe that is empty because it has no sense. So, I'm wondering if there is an elegant way to load the dataframe into the table only if it is not empty without using the if condition. Thanks!!
I recommend to create the dataframe anyway; If you don't create a table with the same schema, even if it's empty, your operations/transformations on DF could fail as it could refer to columns that may not be present.
To handle this, you should always create a DataFrame with the same schema, which means the same column names and datatypes regardless if the data exists or not. You might want to populate it with data later.
If you still want to do it your way, I can point a few ideas for Spark 2.1.0 and above:
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
These are equivalent.
I don't recommend using df.count > 0 because it is linear in time complexity and you would still have to do a check like df != null before.
A much better solution would be:
df.rdd.isEmpty
Or since Spark 2.4.0 there is also Dataset.isEmpty.
As you can see, whatever you decide to do, there is a check somewhere that you need to do, so you can't really get rid of the if condition - as the sentence implies: if you want to avoid creating an empty dataframe.

PySpark withcolumn not keeping new column

I am trying to save change a column from string to integer, I am getting a nonetype error as shown, I am not sure what is going wrong here.
In the cell #113, you ended with printSchema(), which has return type None. Remove that and your data_use will be a dataframe

Pyspark to get the column names of the columns that contains null values

I've a DataFrame where I want to get the column names of the columns that contains one or more null values in them.
So far what I've done :
df.select([c for c in tbl_columns_list if df.filter(F.col(c).isNull()).count() > 0]).columns
I have almost 500 columns in my dataframe and when I execute that code, it becomes incredibly slow for a reason I don't know. Do you have any clue how can I make it work and how can I optimize that please? I need optimized solution in Pyspark please. Thanks in advance.

df.withcolumn is too slow when I iterate through the column data in pyspark dataframe

I am doing the AES Encryption for pyspark dataframe column.
I am iterating the column data, and replacing the column value with encrypted value using df.withcolumn, But it is too slow
I am looking for the alternative approach, But I did not get any
'''
for i in column_data:
obj= AES.new(key, AES.MODE_CBC,v)
ciphertext= obj.encrypt(i)
df=df.withColumn(col,F.when(df[col]==i,str(ciphertext)).otherwise(df[col])) return df
'''
But it's taking long time.
Could you please suggest the other alternative
Your code is slow because of your for-loop, as it forces Spark to run only on one thread.
Please provide an example of input and expected output and someone might be able to help you with rewriting your code.