How to impute NULL values to zero in Spark/Scala - scala

I have a Dataframe in which some columns are of type String and contain NULL as a String value (not as actual NULL). I want to impute them with zero. apparently df.na.fill(0) doesn't work. How can I impute them with zero?

You can use replace() from DataFrameNaFunctions, these can be accessed by the prefix .na:
val df1 = df.na.replace("*", Map("NULL" -> "0"))
You could also create your own udf that replicates this behaviour:
import org.apache.spark.sql.functions.col
val nullReplacer = udf((x: String) => {
if (x == "NULL") "0"
else x
})
val df1 = df.select(df.columns.map(c => nullReplacer(col(c)).alias(c)): _*)
However this would be superfluous given it does the same as the above, at the cost of more lines of code than necessary.

Related

Spark DataFrame Get Null Count For All Columns

I have a DataFrame in which I would like to get the total null values count and I have the following that does this generically on all the columns:
First my DataFrame that just contains one column (for simplicity):
val recVacDate = dfRaw.select("STATE")
When I print using a simple filter, I get to see the following:
val filtered = recVacDate.filter("STATE is null")
println(filtered.count()) // Prints 94051
But when I use this code below, I get just 1 as a result and I do not understand why?
val nullCount = recVacDate.select(recVacDate.columns.map(c => count(col(c).isNull || col(c) === "" || col(c).isNaN).alias(c)): _*)
println(nullCount.count()) // Prints 1
Any ideas as to what is wrong with the nullCount? The DataType of the column is a String.
This kind of fixed it:
df.select(df.columns.map(c => count(when(col(c).isNull || col(c) === "" || col(c).isNaN, c)).alias(c)): _*)
Notice the use of when clause after the count.

Convert udf over multiple columns in scala spark

I have the following code in pyspark which works fine.
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import udf, array
prod_cols = udf(lambda arr: float(arr[0])*float(arr[1]), DoubleType())
finalDf = finalDf.withColumn('click_factor', sum_cols(array('rating', 'score')))
Now i tried similar code in scala.
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf = finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))
Somehow second code doesnt give right answers, always null or zero
Can you help me get the right scala code. Essentially i just need a code two multiply two columns, considering there may be null values of score or rating.
Pass only Not Null values to UDF.
Change below code
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf.withColumn("cl_rate", prod_cols(finalDf("rating"), finalDf("score")))
to
val prod_cols = udf((rating: Double, score: Double) => {rating.toDouble*score.toDouble})
finalDf
.withColumn("rating",$"rating".cast("double")) // Ignore this line if column data type is already double
.withColumn("score",$"score".cast("double")) // Ignore this line if column data type is already double
.withColumn("cl_rate",
when(
$"rating".isNotNull && $"score".isNotNull,
prod_cols($"rating", $"score")
).otherwise(lit(null).cast("double"))
)

Combining When clause with tail: _* on Spark Scala Dataframe

Given this statement 1:
val aggDF3 = aggDF2.select(cols.map { col => ( when(size(aggDF2(col)) === 0,lit(null))
.otherwise(aggDF2(col))).as(s"$col") }: _*)
Given this statement 2:
aggDF.select(colsToSelect.head, colsToSelect.tail: _*).show()
Can I combine the when logic... on statement 1 with the colsToSelect.tail: _* in a single statement, so that the first field is just selected, and the logic only applies to tail scope of dataframe colums? Tried various aspects, but on thin ice here.
This should work:
val aggDF : DataFrame = ???
val colsToSelect : Seq[String] = ???
aggDF
.select((col(colsToSelect.head) +: colsToSelect.tail.map
(col => when(size(aggDF(col)) === 0,lit(null))
.otherwise(aggDF(col)).as(s"$col"))):_*)
.show()
remember that select is overloaded and works differently with String and Column: With cols : Seq[String], you need select(cols.head,cols.tail:_*), with cols : Seq[Column] you need select(cols:_*). The solution above uses the second variant.

Call the function n times by passing 1 to n as a parameter with dataframe as the output of the function

I have a dataframe and lookup dataframe, I want to join my inputDf with my lookupDf N times by passing N as parameter to the function joinByColumn as N becomes one of the joining condition. Also the output should be combination of inputDf and the selected columns in lookupDf.
I can achieve this by foldLeft function but I want to do it using map or iterator function
val result = (0 to n).foldLeft(inputDf) {
case (df, colName) => joinByColumn(colName.toString(), df).toDF()
}
def joinByColumn( value: String, inputDf: DataFrame): DataFrame = {
val lookupDF= readFromCassandraTableAsDataFrame(sqlContext,"keyspace","table")
inputDf.as("src").join(lookupDF,lookupDF("a").equalTo(inputDf("input_a")) && lookupDF("b").equalTo((value.toInt + 1).toString) ], "left")
.select("src.*", "c")
.withColumnRenamed("c", value)
}
I want the output to be datframe with all the joined columns.

How to change column type for a list of dataframe columns

I'm trying to change the type of a list of columns for a Dataframe in Spark 1.6.0.
All the examples found so far however only allow casting for a single column (df.withColumn) or for all the columns in the dataframe:
val castedDF = filteredDf.columns.foldLeft(filteredDf)((filteredDf, c) => filteredDf.withColumn(c, col(c).cast("String")))
Is there any efficient, batch way of doing this for a list of columns in the dataframe?
There is nothing wrong with withColumn* but you can use select if you prefer:
import org.apache.spark.sql.functions col
val columnsToCast: Set[String]
val outputType: String = "string"
df.select(df.columns map (
c => if(columnsToCast.contains(c)) col(c).cast(outputType) else col(c)
): _*)
* Execution plan will be the same for a single select as with chained withColumn.