Replace Empty values with nulls in Spark Dataframe - scala

I have a data frame with n number of columns and I want to replace empty strings in all these columns with nulls.
I tried using
val ReadDf = rawDF.na.replace("columnA", Map( "" -> null));
and
val ReadDf = rawDF.withColumn("columnA", if($"columnA"=="") lit(null) else $"columnA" );
Both of them didn't work.
Any leads would be highly appreciated. Thanks.

Your first approach seams to fail due to a bug that prevents replace from being able to replace values with nulls, see here.
Your second approach fails because you're confusing driver-side Scala code for executor-side Dataframe instructions: your if-else expression would be evaluated once on the driver (and not per record); You'd want to replace it with a call to when function; Moreover, to compare a column's value you need to use the === operator, and not Scala's == which just compares the driver-side Column object:
import org.apache.spark.sql.functions._
rawDF.withColumn("columnA", when($"columnA" === "", lit(null)).otherwise($"columnA"))

Related

pyspark add int column to a fixed date

I have a fixed date "2000/01/01" and a dataframe:
data1 = [{'index':1,'offset':50}]
data_p = sc.parallelize(data1)
df = spark.createDataFrame(data_p)
I want to create a new column by adding the offset column to this fixed date
I tried different method but cannot pass the column iterator and expr error as:
function is neither a registered temporary function nor a permanent function registered in the database 'default'
The only solution I can think of is
df = df.withColumn("zero",lit(datetime.strptime('2000/01/01', '%Y/%m/%d')))
df.withColumn("date_offset",expr("date_add(zero,offset)")).drop("zero")
Since I cannot use lit and datetime.strptime in the expr, I have to use this approach which creates a redundant column and redundant operations.
Any better way to do it?
As you have marked it as pyspark question so in python you can do below
df_a3.withColumn("date_offset",F.lit("2000-01-01").cast("date") + F.col("offset").cast("int")).show()
Edit- As per comment below lets assume there was an extra column of type then based on it below code can be used
df_a3.withColumn("date_offset",F.expr("case when type ='month' then add_months(cast('2000-01-01' as date),offset) else date_add(cast('2000-01-01' as date),cast(offset as int)) end ")).show()

Spark-Scala: Get Dataframe Variable by concatenating two String Variables

I have a scenario where I need to form a dataframe name from two string variable. Which is pretty easy and can be done by concatenating.
Example: "df_" + "part1324"
The above code will return a String variable. I want this to be a Dataframe variable through which I can perform further operation on the data frame.
Map can be used for assign names to DataFrames:
val df = List(("df_value")).toDF()
val stringVariable = "part1324"
// assign name to dataframe
val namedDataFrames = Map("df_" + stringVariable -> df)
// get dataframe by name
namedDataFrames("df_part1324").show(false)
Your question is confusing. What do you mean by dataframe variable? Concatenating two strings will always return String. In order to create a dataframe, you need to apply the different methods available to create a dataframe.
val df:Dataframe cannot be equal to df_part1234 (String)as per your example but to use it as dataframe, you need to do something like below
val df_part1234 = sc.range(1000).toDF("number") where sc is your Sparksession variable.
In case you need to generate this variable dynamically, place it under the logic of variable generation like Loop and add the statement to create the dataframe.
Please rewrite your question if you are trying to achieve something else (along with code snippet to reproduce the issue) or accept the answer if you are clear on the issue

Capture and write string inside of dataframe using foreach row

Trying to capture and write a string value after substituting contents obtained from specific fields from each row of a dataframe using scala. But since it is deployed on cluster not able to capture any records. Can anyone provide a solution?
Assuming TEST_DB.finalresult has 2 fields input1 and input2:
val finalresult=spark.sql("select * from TEST_DB.finalresult")
finalResult.foreach { row =>
val param1=row.getAs("input1").asInstanceOf[String]
val param2=row.getAs("input2").asInstanceOf[String]
val string = """new values of param1 and param2 are -> """ + param1 + """,""" + param2
// how to append modified string to csv file continously for each microbatch in hdfs ??
}
In your code you create the wanted string variable but it is not being saved anywhere, hence you can't see the result.
You can potentially in each foreach execution open up the wanted csv file and append the new string, but I'd like to propose a different solution.
If you can, try to always use built-in functionality of Spark, since it is (usually) more optimised and better in handling null inputs. You can achieve the same by:
import org.apache.spark.sql.functions.{lit, concat, col}
val modifiedFinalResult = finalResult.select(
concat(
lit("new values of param1 and param2 are -> "),
col("input1"),
lit(","),
col("input2")
).alias("string")
)
In variable modifiedFinalResult you will have a spark dataframe with single column named string, which represents the exact same output as your variable string in your code. Afterwards you can save the dataframe directly as a single csv file (using the repartition functionality):
modifiedFinalResult.repartition(1).write.format("csv").save("path/to/your/csv/output")
PS: Also a suggestion for the future, try to avoid naming variables after data types.
UPDATE: Fixed the empty rows issue by using "concat_ws" instead of concat and coalesce to each fields. It seems some of the values which were null were transforming the entire concatenated string to null after the transformation. Nevertheless this solution works for now!

Create new column based on equality between existing columns

While it seems a trivial task, I haven't been able to find a tidy solution for it. I want to add a new (integer) column, nCol to a dataframe, the value of which is determined by comparing two existing columns (both String type) of the dataframe, eCol1 and eCol2
something like:
df(nCol) = {
if df(eCol1) == df(eCol2) then 1
else 0
}
I believe it could be done with the help of user-defined functions (UDFs). But isn't there tidier way for such a trivial task?
You need to work with Dataframe DSL when/otherwise, to test equality use ===:
df
.withColumn("newCol", when(df(eCol1) === df(eCol2),1).otherwise(0))

Removing Blank Strings from a Spark Dataframe

Attempting to remove rows in which a Spark dataframe column contains blank strings. Originally did val df2 = df1.na.drop() but it turns out many of these values are being encoded as "".
I'm stuck using Spark 1.3.1 and also cannot rely on DSL. (Importing spark.implicit_ isn't working.)
Removing things from a dataframe requires filter().
newDF = oldDF.filter("colName != ''")
or am I misunderstanding your question?
In case someone dont want to drop the records with blank strings, but just convvert the blank strings to some constant value.
val newdf = df.na.replace(df.columns,Map("" -> "0")) // to convert blank strings to zero
newdf.show()
You can use this:
df.filter(!($"col_name"===""))
It filters out the columns where the value of "col_name" is "" i.e. nothing/blankstring. I'm using the match filter and then inverting it by "!"
I am also new to spark So I don't know if below mentioned code is more complex or not but it works.
Here we are creating udf which is converting blank values to null.
sqlContext.udf().register("convertToNull",(String abc) -> (abc.trim().length() > 0 ? abc : null),DataTypes.StringType);
After above code you can use "convertToNull" (works on string) in select clause and make all fields null which are blank and than use .na().drop().
crimeDataFrame.selectExpr("C0","convertToNull(C1)","C2","C3").na().drop()
Note : You can use same approach in scala.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-udfs.html