Scala/Spark: Checking for null elements in an array column but IntelliJ suggests not to use null? - scala

I have a column called responseTimes which is of arrayType:
ArrayType(IntegerType,true)
I'm trying to add another column to count the number of null or not-set values in this array:
val contains_null = udf((xs: Seq[Integer]) => xs.contains(null))
df.withColumn("totalNulls", when(contains_null(col("responseTimes")),
lit(1)).otherwise(0))
Although this gives me the right output, IntelliJ keeps telling me to avoid the use of null in my UDF which makes me think this is bad. Is there any other way to do it? Also, is it possible without using UDFs?

The reason is very simple , it is because of the rules of spark udf, well spark deals with null in a different distributed way, I don't know if you know the array_contains built-in function in spark sql.
If UDFs are needed, follow these rules:
Scala code should deal with null values gracefully and shouldn’t error out if there are null values.
Scala code should return None (or null) for values that are unknown, missing, or irrelevant. DataFrames should also use null for for values that are unknown, missing, or irrelevant.
Use Option in Scala code and fall back on null if Option becomes a performance bottleneck.
Please refer to this link if you like tp read more: https://mungingdata.com/apache-spark/dealing-with-null/

You can rewrite your UDF to use Option. In scala, Option(null) gives None, so you can do :
val contains_null = udf((xs: Seq[Integer]) => xs.exists(e => Option(e).isEmpty))
However, if you are using Spark 2.4+, it is more suitable to use Spark built-in functions for this. To check if an array column contains null elements, use exists as suggested by #mck's answer.
If you want to get the count of nulls in array you can combine filter and size function :
df.withColumn("totalNulls", size(expr("filter(responseTimes, x -> x is null)")))

A better way is probably to use higher order function exists to check isnull for each array element:
// sample dataframe
val df = spark.sql("select array(1,null,2) responseTimes union all select array(3,4)")
df.show
+-------------+
|responseTimes|
+-------------+
| [1,, 2]|
| [3, 4]|
+-------------+
// check whether there exists null elements in the array
val df2 = df.withColumn("totalNulls", expr("int(exists(responseTimes, x -> isnull(x)))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+
You can also use array_max together with transform:
val df2 = df.withColumn("totalNulls", expr("int(array_max(transform(responseTimes, x -> isnull(x))))"))
df2.show
+-------------+----------+
|responseTimes|totalNulls|
+-------------+----------+
| [1,, 2]| 1|
| [3, 4]| 0|
+-------------+----------+

Related

unable to convert null value to 0

I'm working with databricks and I don't understand why I'm not able to convert null value to 0 in what it seems like a regular integer column.
I've tried these two options:
#udf(IntegerType())
def null_to_zero(x):
"""
Helper function to transform Null values to zeros
"""
return 0 if x == 'null' else x
and later:
.withColumn("col_test", null_to_zero(col("col")))
and everything is returned as null.
and the second option simply doesn't have any impact .na.fill(value=0,subset=["col"])
What do I'm missing here? Is this a specific behavior of null values with databricks?
The nulls are represented as None, not as a string null. For your case it's better to use coalesce function instead, like this (example based on docs):
from pyspark.sql.functions import coalesce, lit
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.withColumn("col_test", coalesce(cDf["a"], lit(0.0))).show()
will give you desired behavior:
+----+----+--------+
| a| b|col_test|
+----+----+--------+
|null|null| 0.0|
| 1|null| 1.0|
|null| 2| 0.0|
+----+----+--------+
If you need more complex logic, then you can use when/otherwise, with condition on null:
cDf.withColumn("col_test", when(cDf["a"].isNull(), lit(0.0)).otherwise(cDf["a"])).show()

Join 2 DataFrame based on lookup within a Column of collections - Spark,Scala

I have 2 dataframes as below,
val x = Seq((Seq(4,5),"XXX"),(Seq(7),"XYX")).toDF("X","NAME")
val y = Seq((5)).toDF("Y")
I want to join the two dataframes by looking up the value from y and searching the Seq/Array in x.select("X") if exists then join the complete Row with y
How can I acheive this is Spark?
Cheers!
Spark 2.4.3 you could use higher-order function spark
scala> val x = Seq((Seq(4,5),"XXX"),(Seq(7),"XYX")).toDF("X","NAME")
scala> val y = Seq((5)).toDF("Y")
scala> x.join(y,expr("array_contains(X, y)"),"left").show
+------+----+----+
| X|NAME| Y|
+------+----+----+
|[4, 5]| XXX| 5|
| [7]| XYX|null|
+------+----+----+
please confirm that's what you want to achieve?
You can use an UDF for the join, works for all spark versions:
val array_contains = udf((arr:Seq[Int],element:Int) => arr.contains(element))
x
.join(y, array_contains($"X",$"Y"),"left")
.show()
Another approach you can use is to explode your array into rows with the new temporary column. If you run the following code:
x.withColumn("temp", explode('X)).show()
it would show:
+------+----+----+
| X|NAME|temp|
+------+----+----+
|[4, 5]| XXX| 4|
|[4, 5]| XXX| 5|
| [7]| XYX| 7|
+------+----+----+
As you can see you can now just do join using temp and Y columns (and then drop temp):
x.withColumn("temp", explode('X))
.join(y, 'temp === 'Y)
.drop('temp)
This may fail by creating duplicate rows if X contains duplicates. In this case, you'd have to additionally call distinct:
x.withColumn("temp", explode('X))
.distinct()
.join(y, 'temp === 'Y, "left")
.drop('temp)
Since this approach is using spark native methods it will be a little bit faster than one using UDF, but arguably is less elegant.

Select row by value in set after collect_set with pyspark

Using
from pyspark.sql import functions as f
and methods f.agg and f.collect_set I have created a column colSet within a dataFrame as follows:
+-------+--------+
| index | colSet |
+-------+--------+
| 1|[11, 13]|
| 2| [3, 6]|
| 3| [3, 7]|
| 4| [2, 7]|
| 5| [2, 6]|
+-------+--------+
Now, how is it possible, using python/ and pyspark, to select only those rows where, for instance, 3 is an element of the array in the colSet entry (where in general there can be far more than only two entries!)?
I have tried using a udf function like this:
isInSet = f.udf( lambda vcol, val: val in vcol, BooleanType())
being called via
dataFrame.where(isInSet(f.col('colSet'), 3))
I also tried removing f.col from the caller and using it in the definition of isInSet instead, but neither worked, I am getting an exception:
AnalysisException: cannot resolve '3' given input columns: [index, colSet]
Any help is appreciated on how to select rows with a certain entry (or even better subset!!!) given a row with a collect_set result.
Your original UDF is fine, but to use it you need to pass the value 3 as a literal:
dataFrame.where(isInSet(f.col('colSet'), f.lit(3)))
But as jxc points out in a comment, using array_contains is probably a better choice:
dataFrame.where(f.array_contains(f.col('colSet'), 3))
I have not done any benchmarking, but in general using UDFs in PySpark is slower than using built-in functions because of the back-and-forth communication between the JVM and the Python interpreter.
I found a solution today (after failing Friday evening) without using an udf-method:
[3 in x[0] for x in list(dataFrame.select(['colSet']).collect())]
Hope this helps someone else in the future.

Spark (scala) dataframes - Check whether strings in column contain any items from a set

I'm pretty new to scala and spark and I've been trying to find a solution for this issue all day - it's doing my head in. I've tried 20 different variations of the following code and keep getting type mismatch errors when I try to perform calculations on a column.
I have a spark dataframe, and I wish to check whether each string in a particular column contains any number of words from a pre-defined List (or Set) of words.
Here is some example data for replication:
// sample data frame
val df = Seq(
(1, "foo"),
(2, "barrio"),
(3, "gitten"),
(4, "baa")).toDF("id", "words")
// dictionary Set of words to check
val dict = Set("foo","bar","baaad")
Now, i am trying to create a third column with the results of a comparison to see if the strings in the $"words" column within them contain any of the words in the dict Set of words. So the result should be:
+---+-----------+-------------+
| id| words| word_check|
+---+-----------+-------------+
| 1| foo| true|
| 2| bario| true|
| 3| gitten| false|
| 4| baa| false|
+---+-----------+-------------+
First, I tried to see if i could do it natively without using UDFs, since the dict Set will actually be a large dictionary of > 40K words, and as I understand it this would be more efficient than a UDF:
df.withColumn("word_check", dict.exists(d => $"words".contains(d)))
But i get the error:
type mismatch;
found : org.apache.spark.sql.Column
required: Boolean
I have also tried to create a UDF to do this (using both mutable.Set and mutable.WrappedArray to describe the Set - not sure which is correct but neither work):
val checker: ((String, scala.collection.mutable.Set[String]) => Boolean) = (col: String, array: scala.collection.mutable.Set[String] ) => array.exists(d => col.contains(d))
val udf1 = udf(checker)
df.withColumn("word_check", udf1($"words", dict )).show()
But get another type mismatch:
found : scala.collection.immutable.Set[String]
required: org.apache.spark.sql.Column
If the set was a fixed number, I should be able to use Lit(Int) in the expression? But I don't really understand performing more complex functions on a column by mixing different data types works in scala.
Any help greatly appreciated, especially if it can be done efficiently (it is a large df of > 5m rows).
Regardless of efficiency, this seems to work:
df.withColumn("word_check", dict.foldLeft(lit(false))((a, b) => a || locate(b, $"words") > 0)).show
+---+------+----------+
| id| words|word_check|
+---+------+----------+
| 1| foo| true|
| 2|barrio| true|
| 3|gitten| false|
| 4| baa| false|
+---+------+----------+
Here's how you'd do it with a UDF:
val checkerUdf = udf { (s: String) => dict.exists(s.contains(_)) }
df.withColumn("word_check", checkerUdf($"words")).show()
The mistake in your implementation is that you've created a UDF expecting two arguments, which means you'd have to pass two Columns when applying it - but dict isn't a Column in your DataFrame but rather a local vairable.
if your dict is large, you should not just reference it in your udf, because the entire dict is sent over the network for every task. I would broadcast your dict in combination with an udf:
import org.apache.spark.broadcast.Broadcast
def udf_check(words: Broadcast[scala.collection.immutable.Set[String]]) = {
udf {(s: String) => words.value.exists(s.contains(_))}
}
df.withColumn("word_check", udf_check(sparkContext.broadcast(dict))($"words"))
alternatively, you could also use a join:
val dict_df = dict.toList.toDF("word")
df
.join(broadcast(dict_df),$"words".contains($"word"),"left")
.withColumn("word_check",$"word".isNotNull)
.drop($"word")

Count the number of non-null values in a Spark DataFrame

I have a data frame with some columns, and before doing analysis, I'd like to understand how complete the data frame is. So I want to filter the data frame and count for each column the number of non-null values, possibly returning a dataframe back.
Basically, I am trying to achieve the same result as expressed in this question but using Scala instead of Python.
Say you have:
val row = Row("x", "y", "z")
val df = sc.parallelize(Seq(row(0, 4, 3), row(None, 3, 4), row(None, None, 5))).toDF()
How can you summarize the number of non-null values for each column and return a dataframe with the same number of columns and just a single row with the answer?
One straight forward option is to use .describe() function to get a summary of your data frame, where the count row includes a count of non-null values:
df.describe().filter($"summary" === "count").show
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+
Although I like Psidoms answer, often I'm more interested in the fraction of null-values, because just the number of non-null values doesn't tell much...
You can do something like:
import org.apache.spark.sql.functions.{sum,when, count}
df.agg(
(sum(when($"x".isNotNull,0).otherwise(1))/count("*")).as("x : fraction null"),
(sum(when($"y".isNotNull,0).otherwise(1))/count("*")).as("y : fraction null"),
(sum(when($"z".isNotNull,0).otherwise(1))/count("*")).as("z : fraction null")
).show()
EDIT: sum(when($"x".isNotNull,0).otherwise(1)) can also just be replaced by count($"x") which only counts non-null values. As I find this not obvious, I tend to use the sum notation which is more clear
Here's how I did it in Scala 2.11, Spark 2.3.1:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
df.agg(
count("x").divide(count(lit(1)))
.as("x: percent non-null")
// ...copy paste that for columns y and z
).head()
count(*) counts non-null rows, count(1) runs on every row.
If you instead want to count percent null in population, find the complement of our count-based equation:
lit(1).minus(
count("x").divide(count(lit(1)))
)
.as("x: percent null")
It's also worth knowing that you can cast nullness to an integer, then sum it.
But it's probably less performant:
// cast null-ness to an integer
sum(col("x").isNull.cast(IntegerType))
.divide(count(lit(1)))
.as("x: percent null")
Here is the simplest query:
d.filter($"x" !== null ).count
df.select(df.columns map count: _*)
or
df.select(df.columns map count: _*).toDF(df.columns: _*)
Spark 2.3+
(for string and numeric type columns)
df.summary("count").show()
+-------+---+---+---+
|summary| x| y| z|
+-------+---+---+---+
| count| 1| 2| 3|
+-------+---+---+---+