How do I replace string with 0 in multiple columns in Pyspark - pyspark

As in the title. I have a list of columns and need to replace a certain string with 0 in these columns. I can do that using select statement with nested when function but I want to preserve my original dataframe and only change the columns in question. df.replace(string, 0, list_of_columns) doesn't work as there is a data type mismatch.

So I ended up with something like this which worked for me:
for column in column_list:
df = df.withColumn(column, F.when((F.col(column) == "string"), "0").otherwise(F.col(column)))

Related

Pyspark : How to take Minimum in the timestamp column?

In pyspark , i tried to do this
df = df.select(F.col("id"),
F.col("mp_code"),
F.col("mp_def"),
F.col("mp_desc"),
F.col("mp_code_desc"),
F.col("zdmtrt06_zstation").alias("station"),
F.to_timestamp(F.col("date_time"), "yyyyMMddHHmmss").alias("date_time_utc"))
df = df.groupBy("id", "mp_code", "mp_def", "mp_desc", "mp_code_desc", "station").min(F.col("date_time_utc"))
But, i have an issue
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
Here is an extract of the pyspark documentation
GroupedData.min(*cols)[source]
Computes the min value for each numeric column for each group.
New in version 1.3.0.
Parameters: cols : str
In other words, the min function does not support column arguments. It only works with column names (strings) like this:
df.groupBy("x").min("date_time_utc")
# you can also specify several column names
df.groupBy("x").min("y", "z")
Note that if you want to use a column object, you have to use agg:
df.groupBy("x").agg(F.min(F.col("date_time_utc")))

Pyspark dynamic column name

I have a dataframe which contains months and will change quite frequently. I am saving this dataframe values as list e.g. months = ['202111', '202112', '202201']. Using a for loop to to iterate through all list elements and trying to provide dynamic column values with following code:
for i in months:
df = (
adjustment_1_prepared_df.select("product", "mnth", "col1", "col2")
.groupBy("product")
.agg(
f.min(f.when(condition, f.col("col1")).otherwise(9999999)).alias(
concat("col3_"), f.lit(i.col)
)
)
)
So basically in alias I am trying to give column name as a combination of constant (minInv_) and a variable (e.g. 202111) but I am getting error. How can I give a column name as combination of fixed string and a variable.
Thanks in advance!
.alias("col3_"+str(i.col))

Adding empty columns to dataframe with empty values (by type) pyspark

I have the following list:
columns = [('url','string'),('count','bigint'),('isindex','boolean')]
I want to add this columns to my df with empty values:
for column in columns:
df = df.withColumn(column[0], f.lit(?).cast(?))
I am not sure what I need to put in the lit function and in the cast in order to have the suitable empty value for each type
Thank you!

PySpark - iterate rows of a Data Frame

I need to iterate rows of a pyspark.sql.dataframe.DataFrame.DataFrame.
I have done it in pandas in the past with the function iterrows() but I need to find something similar for pyspark without using pandas.
If I do for row in myDF: it iterates columns.DataFrame
Thanks
You can use select method to operate on your dataframe using a user defined function something like this :
columns = header.columns
my_udf = F.udf(lambda data: "do what ever you want here " , StringType())
myDF.select(*[my_udf(col(c)) for c in columns])
then inside the select you can choose what you want to do with each column .

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)
Removing things from a dataframe requires filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?
In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
You can use this:
df.filter(!($"col_name"===""))
It filters out the columns where the value of "col_name" is "" i.e. nothing/blankstring. I'm using the match filter and then inverting it by "!"
I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
Here we are creating udf which is converting blank values to null.
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note : You can use same approach in scala.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html