I want to replace NA values in PySpark, and basically I can
df.na.fill(0) #it works
BUT, I want to replace these values enduringly, it means like using INPLACE in pandas.
When I check nulls after executing code belowe, there is no changes at all.
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show() #I use this code to check null in data set
Do you know how to entirely replace NA in PySpark? I want to achieve the same effect like with inplace in pandas (permanently replaced values).
I've used this code to repalce values:
df.na.fill(0)
and I've used code below to check the effect
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
Unfornutnatelly I can only display new replaced values without permanently replaced nulls in my data set.
To fill the null value in spark, you can just use the .fillna(), which is
df = df.fillna(0)
You can check this docs: https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.fillna.html. However, in spark dataframe, if your filling value is not same as the column data type, even you use .fillna(), the column value will still be null.
Related
I need to assign one DF columns value using other DF columns. For this i wrote below
DF1.withColumm("hic_num",lit(DF2.select("hic_num")))
And got error:
sparkRuntimeException the feature is not supported:literal for [HICN:string] of class org.apache.spark.sql.Dataset.
Please help me with the above.
lit stands for literal and is, as the name suggests, a literal, or a constant. A column is not a constant.
You can do: .withColumn("hic_num2", col("hic_num")), you do not have to wrap this within a lateral.
Also, in your example, you are trying to create a new column called hic_num with the value of hic_num which does not make sense.
I have been scratching my head with a problem in pyspark.
I want to conditionally apply a UDF on a column depending on if it is NULL or not. One constraint is that I do not have access to the DataFrame at the location I am writing the code I only have access to a column object.
Thus, I cannot simply do:
df.where(my_col.isNull()).select(my_udf(my_col)).toPandas()
Therefore, having only access to a Column object, I was writing the following:
my_res_col = F.when(my_col.isNull(), F.lit(0.0) \
.otherwise(my_udf(my_col))
And then later do:
df.select(my_res_col).toPandas()
Unfortunately for some reason that I do not know, I sill receive NULLs in my UDF, forcing me to check for NULL values directly in my UDF.
I do not understand why the isNull() is not preventing rows with NULL values from calling the UDF.
Any insight on this matter would be greatly appreciated.
I thank you in advance for your help.
Antoine
I am not sure about your data. does it contains nan? spark handles null and nan differently.
Differences between null and NaN in spark? How to deal with it?
so can you just try the below and check if it solves
import pyspark.sql.functions as F
my_res_col = F.when(((my_col.isNull())|(F.isnan(mycol))), F.lit(0.0)).otherwise(my_udf(my_col))
I have a function to which I am passing a dataframe and listofcolumns which should not contain a NULL value. If any of the columns from "listofcolumns" has a null value, I need to take an action.
Now, I have to use the when clause here but the columns passed to the when clause will vary based on the dataframe and listofcolumns passed. So I want to be able to generate the when clause dynamically using the columns passed. The when clause could be checking for NULL value in just one column or multiple columns in the dataframe. Thus I cannot hard code to use one condition or multiple conditions.
I have tried generating the whenClause string dynamically and passing as a variable but get an error that "TypeError: condition should be a Column".
Can someone please advise how I can achieve this?
This can be achieved with resolving your the selection logic on your columns ahead of time and then using functools.reduce and operator, such as:
import functools
import operator
import pyspark.sql.functions as f
# conditional selection of columns - your logic on selecting
# which columns to check for null goes here
my_cols = [col for col in df.columns if "condition" in col]
# now I want to create my condition on these columns
# since it can be any of them, I use operator.or_
# but your logic may vary here - apply to my_cols created above
cond_expr = functools.reduce(operator.or_, [f.col(c).isNull() for c in my_cols])
# now you apply your action
df.withColumn(
"output_column",
f.when(cond_expr, TRUE_ACTION).otherwise(FALSE_ACTION)
)
Where TRUE_ACTION is when your condition of any equal to null is satisfied. If you wish to check for all columns in your condition being null, replace operator.or_ with operator.and_ and build your logic from there. Hope this helps!
I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type
Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])
I have two dataframes: tr is a training-set, ts is a test-set.
They contain columns uid (a user_id), categ (a categorical), and response.
response is the dependent variable I'm trying to predict in ts.
I am trying to compute the mean of response in tr, broken out by columns uid and categ:
avg_response_uid_categ = tr.groupby(['uid','categ']).response.mean()
This gives the result but (unwantedly) the dataframe index is a MultiIndex. (this is the groupby(..., as_index=True) behavior):
MultiIndex[--5hzxWLz5ozIg6OMo6tpQ SomeValueOfCateg, --65q1FpAL_UQtVZ2PTGew AnotherValueofCateg, ...
But instead I want the result to keep the two columns 'uid', 'categ' and keep them separate.
Should I use aggregate() instead of groupby()?
Trying groupby(as_index=False) is useless.
The result seems to differ depending on whether you do:
tr.groupby(['uid','categ']).response.mean()
or:
tr.groupby(['uid','categ'])['response'].mean() # RIGHT
i.e. whether you slice a single Series, or a DataFrame containing a single Series. Related: Pandas selecting by label sometimes return Series, sometimes returns DataFrame