Spark DataFrame Get Null Count For All Columns - scala

I have a DataFrame in which I would like to get the total null values count and I have the following that does this generically on all the columns:
First my DataFrame that just contains one column (for simplicity):
val recVacDate = dfRaw.select("STATE")
When I print using a simple filter, I get to see the following:
val filtered = recVacDate.filter("STATE is null")
println(filtered.count()) // Prints 94051
But when I use this code below, I get just 1 as a result and I do not understand why?
val nullCount = recVacDate.select(recVacDate.columns.map(c => count(col(c).isNull || col(c) === "" || col(c).isNaN).alias(c)): _*)
println(nullCount.count()) // Prints 1
Any ideas as to what is wrong with the nullCount? The DataType of the column is a String.

This kind of fixed it:
df.select(df.columns.map(c => count(when(col(c).isNull || col(c) === "" || col(c).isNaN, c)).alias(c)): _*)
Notice the use of when clause after the count.

Related

Adding elements to ArrayBuffer in from a for loop in scala

import scala.collection.mutable.ArrayBuffer
spark.sql("set db=test_script")
spark.sql("set table=member_test")
val colDF = sql("show columns from ${table} from ${db}")
var tempArray = new ArrayBuffer[String]()
var temp
colDF.foreach { row => row.toSeq.foreach { col =>
temp = "count(case when "+ col+ " ='X' then 1 else NULL END) AS count"+ col
tempArray += temp
}}
println(tempArray) // getting empty array
println(temp) // getting blank string
Hi, I am new to scala programming. I am trying to loop through a dataframe and append the formatted String data to my ArrayBuffer.
When I put the print statement inside the for loop, everything, seems to be fine, whereas If i try to access the arrayBuffer outside the loop, its empty.
Is it something related to the scope of the variable?
I am using arrayBuffer, because I got to know that list is mutable in Scala.
Please suggest any better way if you have.
Thanks in advance
The issue you are having is that spark is a distributed system, which means copies of your buffer are sent to each executor (And not returned back to the driver), hence why it is empty.
Also note that colDF is a DataFrame. This means that when you do
row => row.toSeq
The result of this is an Array(Any) (this isn't good practice). A better way of doing this would be:
val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val sqlStatement = columns.map(c => s"count(case when $c = 'X' then 1 else NULL END) as count$c")
However, even better is not to use SQL at all and use Spark!
val dataFrame: DataFrame = spark.sql("select * from test_script.member_test")
val columns: Array[String] = dataFrame.columns
val selectStatement: List[Column] = columns.map{ c =>
count(when(col(c) === "X", lit(1)).as(s"count$c")
}.toList
dataFrame.select(selectStatement :_*)

Dynamically generate code having filter, withColumnRenamed and coalesce condition Scala Spark

I have a piece of code which I want to generate dynamically. I want to take below columns in the form of a list or Sequence and perform filter operation with coalesce inside, drop and withColumnRenamed statements.
Here the list of columns that I want to accept dynamically (here as a string).
val cols = "a|tmp_a,b|tmp_b"
The code looks something like this:
val df1 = df2.filter(!(coalesce(col("a"), lit(0)) === coalesce(col("tmp_a"), lit(0))) || !(upper(col("b")) === upper(col("tmp_b"))))
.drop("a")
.drop("b")
.withColumnRenamed("tmp_a", "a")
.withColumnRenamed("tmp_b", "b")
If more columns are added to cols, how can the code be adapted dynamically? New column pairs should use the same filter condition as the "b|tmp_b" above.
Given an input with the pairs of column names, you can create the two types of filter conditions (below the first column pair uses the first filter pattern and the rest uses the second). After the dataframe is filtered, the drop and withColumnRenamed can be applied using a foldLeft.
val cols = "a|tmp_a,b|tmp_b,c|tmp_c".split(",").map(_.split("\\|"))
val filterCondHead = !(coalesce(col(cols.head(0)), lit(0)) === coalesce(col(cols.head(1)), lit(0)))
val filterCondTail = cols.tail.map(c => !(upper(col(c(0))) === upper(col(c(1))))).reduce(_ || _)
val df2 = df.filter(filterCondHead || filterCondTail)
val df3 = cols.foldLeft(df2){ case(df, c) =>
df.drop(c(0)).withColumnRenamed(c(1), c(0))
}

Check every column in a spark dataframe has a certain value

Can we check to see if every column in a spark dataframe contains a certain string(example "Y") using Spark-SQL or scala?
I have tried the following but don't think it is working properly.
df.select(df.col("*")).filter("'*' =='Y'")
Thanks,
Sai
You can do something like this to keep the rows where all columns contain 'Y':
//Get all columns
val columns: Array[String] = df.columns
//For each column, keep the rows with 'Y'
val seqDfs: Seq[DataFrame] = columns.map(name => df.filter(s"$name == 'Y'"))
//Union all the dataframes together into one final dataframe
val output: DataFrame = seqDfs.reduceRight(_ union _)
You can use data frame method columns to get all column's names
val columnNames: Array[String] = df.columns
and then add all filters in a loop
var filteredDf = df.select(join5.col("*"))
for(name <- columnNames) {
filteredDf = filteredDf.filter(s"$name =='Y'")
}
or you can create a SQL query using same approach
If you want to filter every row, in which any of the columns is equal to 1 (or anything else), you can dynamically create a query like this:
cols = [col(c) == lit(1) for c in patients.columns]
query = cols[0]
for c in cols[1:]:
query |= c
df.filter(query).show()
It's a bit verbose, but it is very clear what is happening. A more elegant version would be:
res = df.filter(reduce(lambda x, y: x | y, (col(c) == lit(1) for c in cols)))
res.show()

Count empty values in dataframe column in Spark (Scala)

I'm trying to count empty values in column in DataFrame like this:
df.filter((df(colname) === null) || (df(colname) === "")).count()
In colname there is a name of the column. This works fine if column type is string but if column type is integer and there are some nulls this code always returns 0. Why is this so? How to change it to make it work?
As mentioned on the question that df.filter((df(colname) === null) || (df(colname) === "")).count() works for String data types but the testing shows that null are not handled.
#Psidom's answer handles both null and empty but does not handle for NaN.
checking for .isNaN should handle all three cases
df.filter(df(colName).isNull || df(colName) === "" || df(colName).isNaN).count()
You can use isNull to test the null condition:
val df = Seq((Some("a"), Some(1)), (null, null), (Some(""), Some(2))).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: int]
df.filter(df("A").isNull || df("A") === "").count
// res7: Long = 2
df.filter(df("B").isNull || df("B") === "").count
// res8: Long = 1

How to impute NULL values to zero in Spark/Scala

I have a Dataframe in which some columns are of type String and contain NULL as a String value (not as actual NULL). I want to impute them with zero. apparently df.na.fill(0) doesn't work. How can I impute them with zero?
You can use replace() from DataFrameNaFunctions, these can be accessed by the prefix .na:
val df1 = df.na.replace("*", Map("NULL" -> "0"))
You could also create your own udf that replicates this behaviour:
import org.apache.spark.sql.functions.col
val nullReplacer = udf((x: String) => {
if (x == "NULL") "0"
else x
})
val df1 = df.select(df.columns.map(c => nullReplacer(col(c)).alias(c)): _*)
However this would be superfluous given it does the same as the above, at the cost of more lines of code than necessary.