Count empty values in dataframe column in Spark (Scala) - scala

I'm trying to count empty values in column in DataFrame like this:
df.filter((df(colname) === null) || (df(colname) === "")).count()
In colname there is a name of the column. This works fine if column type is string but if column type is integer and there are some nulls this code always returns 0. Why is this so? How to change it to make it work?

As mentioned on the question that df.filter((df(colname) === null) || (df(colname) === "")).count() works for String data types but the testing shows that null are not handled.
#Psidom's answer handles both null and empty but does not handle for NaN.
checking for .isNaN should handle all three cases
df.filter(df(colName).isNull || df(colName) === "" || df(colName).isNaN).count()

You can use isNull to test the null condition:
val df = Seq((Some("a"), Some(1)), (null, null), (Some(""), Some(2))).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: string, B: int]
df.filter(df("A").isNull || df("A") === "").count
// res7: Long = 2
df.filter(df("B").isNull || df("B") === "").count
// res8: Long = 1

Related

Spark DataFrame Get Null Count For All Columns

I have a DataFrame in which I would like to get the total null values count and I have the following that does this generically on all the columns:
First my DataFrame that just contains one column (for simplicity):
val recVacDate = dfRaw.select("STATE")
When I print using a simple filter, I get to see the following:
val filtered = recVacDate.filter("STATE is null")
println(filtered.count()) // Prints 94051
But when I use this code below, I get just 1 as a result and I do not understand why?
val nullCount = recVacDate.select(recVacDate.columns.map(c => count(col(c).isNull || col(c) === "" || col(c).isNaN).alias(c)): _*)
println(nullCount.count()) // Prints 1
Any ideas as to what is wrong with the nullCount? The DataType of the column is a String.
This kind of fixed it:
df.select(df.columns.map(c => count(when(col(c).isNull || col(c) === "" || col(c).isNaN, c)).alias(c)): _*)
Notice the use of when clause after the count.

Compare timestamp columns - Spark scala

I am trying to compare 2 timestamp columns and return the value which is minimum value among the two and wondering is there a better way than what I have. Note both columns might have values / one of them has value and other could be null. I know about When.otherwise.
import org.apache.spark.sql.functions._
import java.sql.Timestamp
val compareTime = udf((t1: Timestamp, t2: Timestamp) => {
if(t1 != null && t2 != null && t1.before(t2) ){
Some(t1)
}
else if(t1 != null && t2 != null && t2.before(t1)){
Some(t2)
}
else if(t1 != null){
Some(t1)
}
else if(t2 != null){
Some(t2)
}
else {
None
}
})
var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
.withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))
df = df.withColumn("t3",compareTime($"t1",$"t2"))
df.show()
UDF is probably unnecessary here - you can use the Spark SQL function least:
var df = Seq((1L, "2021-01-04 16:10:00","2021-01-04 15:20:00")).toDF("id","t1","t2")
df = (df.withColumn("t1",to_timestamp($"t1","yyyy-MM-dd HH:mm:ss"))
.withColumn("t2",to_timestamp($"t2","yyyy-MM-dd HH:mm:ss")))
df = df.withColumn("t3",least($"t1",$"t2"))
df.show()
+---+-------------------+-------------------+-------------------+
| id| t1| t2| t3|
+---+-------------------+-------------------+-------------------+
| 1|2021-01-04 16:10:00|2021-01-04 15:20:00|2021-01-04 15:20:00|
+---+-------------------+-------------------+-------------------+
The opposite of least is greatest, if you want to get the larger one of the two columns.
Note that both least and greatest will ignore null values, but they will return null if all input columns are null.
Try this:
(Option(t1) ++ Option(t2)).minOption
It should do the same job as your if..else if..else stack.
Oops. My bad. Spark doesn't do Scala 2.13.x. Try this instead:
util.Try((Option(t1) ++ Option(t2)).minBy(_.getTime())).toOption

How can I filter spark Dataframe according to the value that column contains?

In the dataset I have None or NA values for some string columns, I want to count how many of these null values does the dataset contain? According to that decide which approach to use for missing values.
I tried very in an efficient way of filtering the Dataframe for each column with or expression. I want to filter them in a more efficient and neat way. It will be better to do it without converting it to rdd, but if it is not possible using dataframe to do this kind of filtering rdd way is also acceptable.
I found this thread Spark SQL filter multiple fields so similar to my question, but I want a more neat and elegant way to write this because I have so many columns
// trainDataFull is my dataframe
val nullValues = Array("NA", "None")
val filtered = trainDataFull.filter(trainDataFull("Alley").isin(nullValues:_*) ||
trainDataFull("MSZoning").isin(nullValues:_*) ||
trainDataFull("Street").isin(nullValues:_*) ||
trainDataFull("LotShape").isin(nullValues:_*) ||
trainDataFull("LandContour").isin(nullValues:_*) ||
trainDataFull("Utilities").isin(nullValues:_*) ||
trainDataFull("LotConfig").isin(nullValues:_*) ||
trainDataFull("LandSlope").isin(nullValues:_*) ||
trainDataFull("Neighborhood").isin(nullValues:_*) ||
trainDataFull("Condition1").isin(nullValues:_*) ||
trainDataFull("Condition2").isin(nullValues:_*) ||
trainDataFull("BldgType").isin(nullValues:_*) ||
trainDataFull("HouseStyle").isin(nullValues:_*) ||
trainDataFull("RoofStyle").isin(nullValues:_*) ||
trainDataFull("RoofMatl").isin(nullValues:_*) ||
trainDataFull("Exterior1st").isin(nullValues:_*) ||
trainDataFull("Exterior2nd").isin(nullValues:_*) ||
trainDataFull("MasVnrType").isin(nullValues:_*) ||
trainDataFull("MasVnrArea").isin(nullValues:_*) ||
trainDataFull("ExterQual").isin(nullValues:_*) ||
trainDataFull("MasVnrArea").isin(nullValues:_*) ||
trainDataFull("ExterQual").isin(nullValues:_*) ||
trainDataFull("ExterCond").isin(nullValues:_*) ||
trainDataFull("Foundation").isin(nullValues:_*) ||
trainDataFull("BsmtQual").isin(nullValues:_*) ||
trainDataFull("BsmtCond").isin(nullValues:_*) ||
trainDataFull("BsmtExposure").isin(nullValues:_*)
)
I want to see which column has how many null values.
You can always generate the query programatically
val nullValues = Array("NA", "None")
val df = Seq(("NA", "Foo"), ("None", "NA")).toDF("MSZoning", "Street")
val columns = df.schema.collect {
case StructField(name, StringType, _, _) =>
sum(when(col(name).isInCollection(nullValues), 1)).as(name)
}
df.select(columns:_*).show()
Output:
+--------+------+
|MSZoning|Street|
+--------+------+
| 2| 1|
+--------+------+

SPARK SQL : How to filter records by multiple colmuns and using groupBy too

//dataset
michael,swimming,silve,2016,USA
usha,running,silver,2014,IND
lisa,javellin,gold,2014,USA
michael,swimming,silver,2017,USA
Questions --
1) How many silver medals have been won by the USA in each sport -- and the code throws the error value === is not the member of string
val rdd = sc.textFile("/home/training/mydata/sports.txt")
val text =rdd.map(lines=>lines.split(",")).map(arrays=>arrays(0),arrays(1),arrays(2),arrays(3),arrays(4)).toDF("first_name","sports","medal_type","year","country")
text.filter(text("medal_type")==="silver" && ("country")==="USA" groupBy("year").count().show
2) What is the difference between === and ==
When I use filter and select with === with just one condition in it (no && or ||), it shows me the string result and boolean result respectively but when I use select and filter with ==, errors throws
using this:
text.filter(text("medal_type")==="silver" && text("country")==="USA").groupBy("year").count().show
+----+-----+
|year|count|
+----+-----+
|2017| 1|
+----+-----+
Will just answer your first question. (note that there is a typo in silver in first line)
About the second question:
== and === is just a functions in Scala
In spark === is using equalTo method which is the equality test
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#equalTo-java.lang.Object-
// Scala:
df.filter( df("colA") === df("colB") )
// Java
import static org.apache.spark.sql.functions.*;
df.filter( col("colA").equalTo(col("colB")) );
and == is using euqals method which just test if two references are the same object.
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Column.html#equals-java.lang.Object-
Notice the return types of each function == (equals) returns boolean while === (equalTo) returns a Column of the results.

How to impute NULL values to zero in Spark/Scala

I have a Dataframe in which some columns are of type String and contain NULL as a String value (not as actual NULL). I want to impute them with zero. apparently df.na.fill(0) doesn't work. How can I impute them with zero?
You can use replace() from DataFrameNaFunctions, these can be accessed by the prefix .na:
val df1 = df.na.replace("*", Map("NULL" -> "0"))
You could also create your own udf that replicates this behaviour:
import org.apache.spark.sql.functions.col
val nullReplacer = udf((x: String) => {
if (x == "NULL") "0"
else x
})
val df1 = df.select(df.columns.map(c => nullReplacer(col(c)).alias(c)): _*)
However this would be superfluous given it does the same as the above, at the cost of more lines of code than necessary.