SparkSQL: How to deal with null values in user defined function? - scala

Given Table 1 with one column "x" of type String.
I want to create Table 2 with a column "y" that is an integer representation of the date strings given in "x".
Essential is to keep null values in column "y".
Table 1 (Dataframe df1):
+----------+
| x|
+----------+
|2015-09-12|
|2015-09-13|
| null|
| null|
+----------+
root
|-- x: string (nullable = true)
Table 2 (Dataframe df2):
+----------+--------+
| x| y|
+----------+--------+
| null| null|
| null| null|
|2015-09-12|20150912|
|2015-09-13|20150913|
+----------+--------+
root
|-- x: string (nullable = true)
|-- y: integer (nullable = true)
While the user-defined function (udf) to convert values from column "x" into those of column "y" is:
val extractDateAsInt = udf[Int, String] (
(d:String) => d.substring(0, 10)
.filterNot( "-".toSet)
.toInt )
and works, dealing with null values is not possible.
Even though, I can do something like
val extractDateAsIntWithNull = udf[Int, String] (
(d:String) =>
if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt
else 1 )
I have found no way, to "produce" null values via udfs (of course, as Ints can not be null).
My current solution for creation of df2 (Table 2) is as follows:
// holds data of table 1
val df1 = ...
// filter entries from df1, that are not null
val dfNotNulls = df1.filter(df1("x")
.isNotNull)
.withColumn("y", extractDateAsInt(df1("x")))
.withColumnRenamed("x", "right_x")
// create df2 via a left join on df1 and dfNotNull having
val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x")
Questions:
The current solution seems cumbersome (and probably not efficient wrt. performance). Is there a better way?
#Spark-developers: Is there a type NullableInt planned / avaiable, such that the following udf is possible (see Code excerpt ) ?
Code excerpt
val extractDateAsNullableInt = udf[NullableInt, String] (
(d:String) =>
if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt
else null )

This is where Optioncomes in handy:
val extractDateAsOptionInt = udf((d: String) => d match {
case null => None
case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt)
})
or to make it slightly more secure in general case:
import scala.util.Try
val extractDateAsOptionInt = udf((d: String) => Try(
d.substring(0, 10).filterNot("-".toSet).toInt
).toOption)
All credit goes to Dmitriy Selivanov who've pointed out this solution as a (missing?) edit here.
Alternative is to handle null outside the UDF:
import org.apache.spark.sql.functions.{lit, when}
import org.apache.spark.sql.types.IntegerType
val extractDateAsInt = udf(
(d: String) => d.substring(0, 10).filterNot("-".toSet).toInt
)
df.withColumn("y",
when($"x".isNull, lit(null))
.otherwise(extractDateAsInt($"x"))
.cast(IntegerType)
)

Scala actually has a nice factory function, Option(), that can make this even more concise:
val extractDateAsOptionInt = udf((d: String) =>
Option(d).map(_.substring(0, 10).filterNot("-".toSet).toInt))
Internally the Option object's apply method is just doing the null check for you:
def apply[A](x: A): Option[A] = if (x == null) None else Some(x)

Supplementary code
With the nice answer of #zero323, I created the following code, to have user defined functions available that handle null values as described. Hope, it is helpful for others!
/**
* Set of methods to construct [[org.apache.spark.sql.UserDefinedFunction]]s that
* handle `null` values.
*/
object NullableFunctions {
import org.apache.spark.sql.functions._
import scala.reflect.runtime.universe.{TypeTag}
import org.apache.spark.sql.UserDefinedFunction
/**
* Given a function A1 => RT, create a [[org.apache.spark.sql.UserDefinedFunction]] such that
* * if fnc input is null, None is returned. This will create a null value in the output Spark column.
* * if A1 is non null, Some( f(input) will be returned, thus creating f(input) as value in the output column.
* #param f function from A1 => RT
* #tparam RT return type
* #tparam A1 input parameter type
* #return a [[org.apache.spark.sql.UserDefinedFunction]] with the behaviour describe above
*/
def nullableUdf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = {
udf[Option[RT],A1]( (i: A1) => i match {
case null => None
case s => Some(f(i))
})
}
/**
* Given a function A1, A2 => RT, create a [[org.apache.spark.sql.UserDefinedFunction]] such that
* * if on of the function input parameters is null, None is returned.
* This will create a null value in the output Spark column.
* * if both input parameters are non null, Some( f(input) will be returned, thus creating f(input1, input2)
* as value in the output column.
* #param f function from A1 => RT
* #tparam RT return type
* #tparam A1 input parameter type
* #tparam A2 input parameter type
* #return a [[org.apache.spark.sql.UserDefinedFunction]] with the behaviour describe above
*/
def nullableUdf[RT: TypeTag, A1: TypeTag, A2: TypeTag](f: Function2[A1, A2, RT]): UserDefinedFunction = {
udf[Option[RT], A1, A2]( (i1: A1, i2: A2) => (i1, i2) match {
case (null, _) => None
case (_, null) => None
case (s1, s2) => Some((f(s1,s2)))
} )
}
}

Related

Spark scala dataframe get value for each row and assign to variables

I have a dataframe like below :
val df=spark.sql("select * from table")
row1|row2|row3
A1,B1,C1
A2,B2,C2
A3,B3,C3
i want to iterate for loop to get values like this :
val value1="A1"
val value2="B1"
val value3="C1"
function(value1,value2,value3)
Please help me.
emphasized text
You have 2 options :
Solution 1- Your data is big, then you must stick with dataframes. So to apply a function on every row. We must define a UDF.
Solution 2- Your data is small, then you can collect the data to the driver machine and then iterate with a map.
Example:
val df = Seq((1,2,3), (4,5,6)).toDF("a", "b", "c")
def sum(a: Int, b: Int, c: Int) = a+b+c
// Solution 1
import org.apache.spark.sql.Row
val myUDF = udf((r: Row) => sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
df.select(myUDF(struct($"a", $"b", $"c")).as("sum")).show
//Solution 2
df.collect.map(r=> sum(r.getAs[Int](0), r.getAs[Int](1), r.getAs[Int](2)))
Output for both cases:
+---+
|sum|
+---+
| 6|
| 15|
+---+
EDIT:
val myUDF = udf((r: Row) => {
val value1 = r.getAs[Int](0)
val value2 = r.getAs[Int](1)
val value3 = r.getAs[Int](2)
myFunction(value1, value2, value3)
})

Filtering dataframe array items based on an external array with intersection

I'm trying to define a way to filter elements from WrappedArrays in DFs. The filter is based on an external list of elements.
Looking for a solutions I found this question. It is very similar, but it seems not to work for me. I'm using Spark 2.4.0. This is my code:
val df = sc.parallelize(Array((1, Seq("s", "v", "r")),(2, Seq("r", "a", "v")),(3, Seq("s", "r", "t")))).toDF("foo","bar")
def filterItems(flist: Seq[String]) = udf {
(recs: Seq[String]) => recs match {
case null => Seq.empty[String]
case recs => recs.intersect(flist)
}}
df.withColumn("filtercol", filterItems(Seq("s", "v"))(col("bar"))).show(5)
My expected result would be:
+---+---------+---------+
|foo| bar|filtercol|
+---+---------+---------+
| 1 |[s, v, r]| [s, v]|
| 2 |[r, a, v]| [v]|
| 3| [s, r, t]| [s]|
+---+---------+---------+
But I'm getting this error:
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
You can use the build-in function in Spark 2.4 without too much effort actually:
import org.apache.spark.sql.functions.{array_intersect, array, lit}
val df = sc.parallelize(Array((1, Seq("s", "v", "r")),(2, Seq("r", "a", "v")),(3, Seq("s", "r", "t")))).toDF("foo","bar")
val ar = Seq("s", "v").map(lit(_))
df.withColumn("filtercol", array_intersect($"bar", array(ar:_*))).show
Output:
+---+---------+---------+
|foo| bar|filtercol|
+---+---------+---------+
| 1|[s, v, r]| [s, v]|
| 2|[r, a, v]| [v]|
| 3|[s, r, t]| [s]|
+---+---------+---------+
The only tricky part is Seq("s", "v").map(lit(_)) which will map each string into lit(i). The intersection function accepts two arrays. The first one is the value of bar column. The second one is created it on the fly with array(ar:_*), which will contain values of lit(i).
If you pass an attribute of ArrayType into an UDF, it arrives as an instance of WrappedArray, which is not a List. So you should change recs type to Seq, IndexedSeq or WrappedArray, normally I just use plain Seq:
def filterItems(flist: List[String]) = udf {
(recs: Seq[String]) => recs match {
case null => Seq.empty[String]
case recs => recs.intersect(flist)
}}

Split text and find the common words in a Spark Dataframe

I am working on Scala with Spark and I have a dataframe including two columns with text.
Those columns are with the format of "term1, term2, term3,..." and I want to create a third column with the common terms of the two of them.
For example
Col1
orange, apple, melon
party, clouds, beach
Col2
apple, apricot, watermelon
black, yellow, white
The result would be
Col3
1
0
What I have done until now is to create a udf that splits the text and get the intersection of the two columns.
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
0
} else {
split(a, ",").intersect(split(b, ",")).length
})
And then on my dataframe
val results = termsDF.withColumn("col3", common_terms(col("col1"), col("col2"))
But I have the following error
Error:(96, 13) type mismatch;
found : String
required: org.apache.spark.sql.Column
split(a, ",").intersect(split(b, ",")).length
I would appreciate any help since I am new in Scala and just trying to learn from online tutorials.
EDIT:
val common_authors = udf((a: String, b: String) => if (a != null || b != null) {
0
} else {
val tempA = a.split( ",")
val tempB = b.split(",")
if ( tempA.isEmpty || tempB.isEmpty ) {
0
} else {
tempA.intersect(tempB).length
}
})
After the edit, if I try termsDF.show() it runs. But if I do something like that termsDF.orderBy(desc("col3")) then I get a java.lang.NullPointerException
Try
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
0
} else {
var tmp1 = a.split(",")
var tmp2 = b.split(",")
tmp1.intersect(tmp2).length
})
val results = termsDF.withColumn("col3", common_terms($"a", $"b")).show
split(a, ",") its a spark column functions.
You are using an udf so you need to use string.split() wich is a scala function
After edit: change null verification to == not !=
In Spark 2.4 sql, you can get the same results without UDF. Check this out:
scala> val df = Seq(("orange,apple,melon","apple,apricot,watermelon"),("party,clouds,beach","black,yellow,white"), ("orange,apple,melon","apple,orange,watermelon")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> df.createOrReplaceTempView("tasos")
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).show(false)
+------------------+------------------------+---------------+
|col1 |col2 |a1 |
+------------------+------------------------+---------------+
|orange,apple,melon|apple,apricot,watermelon|[apple] |
|party,clouds,beach|black,yellow,white |[] |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|
+------------------+------------------------+---------------+
If you want the size, then
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).withColumn("a1_size",size('a1)).show(false)
+------------------+------------------------+---------------+-------+
|col1 |col2 |a1 |a1_size|
+------------------+------------------------+---------------+-------+
|orange,apple,melon|apple,apricot,watermelon|[apple] |1 |
|party,clouds,beach|black,yellow,white |[] |0 |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|2 |
+------------------+------------------------+---------------+-------+
scala>

How to run udf on every column in a dataframe?

I have a UDF:
val TrimText = (s: AnyRef) => {
//does logic returns string
}
And a dataframe:
var df = spark.read.option("sep", ",").option("header", "true").csv(root_path + "/" + file)
I would like to perform TrimText on every value in every column in the dataframe.
However, the problem is, I have a dynamic number of columns. I know I can get the list of columns by df.columns. But I am unsure on how this will help me with my issue. How can I solve this problem?
TLDR Issue - Performing a UDF on every column in a dataframe, when the dataframe has an unknown number of columns
Attempting to use:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)
Throws this error:
error: type mismatch;
found : String
required: org.apache.spark.sql.Column
accDF.withColumn(c, TrimText(col(c)))
TrimText is suppose to return a string and expects the input to be a value in a column. So it is going to be standardizing every value in every row of the entire dataframe.
You can use foldLeft to traverse the column list to iteratively apply withColumn to the DataFrame using your UDF:
df.columns.foldLeft( df )( (accDF, c) =>
accDF.withColumn(c, TrimText(col(c)))
)
>> I would like to perform TrimText on every value in every column in the dataframe.
>> I have a dynamic number of columns.
when sql function is available for trimming why UDF, could see below code fit's for you ?
import org.apache.spark.sql.functions._
spark.udf.register("TrimText", (x:String) => ..... )
val df2 = sc.parallelize(List(
(26, true, 60000.00),
(32, false, 35000.00)
)).toDF("age", "education", "income")
val cols2 = df2.columns.toSet
df2.createOrReplaceTempView("table1")
val query = "select " + buildcolumnlst(cols2) + " from table1 "
println(query)
val dfresult = spark.sql(query)
dfresult.show()
def buildcolumnlst(myCols: Set[String]) = {
myCols.map(x => "TrimText(" + x + ")" + " as " + x).mkString(",")
}
results,
select trim(age) as age,trim(education) as education,trim(income) as income from table1
+---+---------+-------+
|age|education| income|
+---+---------+-------+
| 26| true|60000.0|
| 32| false|35000.0|
+---+---------+-------+
val a = sc.parallelize(Seq(("1 "," 2"),(" 3","4"))).toDF()
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def TrimText(s: Column): Column = {
//does logic returns string
trim(s)
}
a.select(a.columns.map(c => TrimText(col(c))):_*).show

Spark DataFrame - drop null values from column

Given a dataframe :
val df = sc.parallelize(Seq(("foo", ArrayBuffer(null,"bar",null)), ("bar", ArrayBuffer("one","two",null)))).toDF("key", "value")
df.show
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(null,bar,null)|
|bar|ArrayBuffer(one, two,null)|
+---+--------------------------+
I'd like to drop null from column value. After removal the dataframe should look like this :
+---+--------------------------+
|key| value|
+---+--------------------------+
|foo|ArrayBuffer(bar) |
|bar|ArrayBuffer(one, two) |
+---+--------------------------+
Any suggestion welcome . 10x
You'll need an UDF here. For example with a flatMap:
val filterOutNull = udf((xs: Seq[String]) =>
Option(xs).map(_.flatMap(Option(_))))
df.withColumn("value", filterOutNull($"value"))
where external Option with map handles NULL columns:
Option(null: Seq[String]).map(identity)
Option[Seq[String]] = None
Option(Seq("foo", null, "bar")).map(identity)
Option[Seq[String]] = Some(List(foo, null, bar))
and ensures we don't fail with NPE when input is NULL / null by mapping
NULL -> null -> None -> None -> NULL
where null is a Scala null and NULL is a SQL NULL.
The internal flatMap flattens a sequence of Options effectively filtering nulls:
Seq("foo", null, "bar").flatMap(Option(_))
Seq[String] = List(foo, bar)
A more imperative equivalent could be something like this:
val imperativeFilterOutNull = udf((xs: Seq[String]) =>
if (xs == null) xs
else for {
x <- xs
if x != null
} yield x)
Option 1: using UDF:
val filterNull = udf((arr : Seq[String]) => arr.filter((x: String) => x != null))
df.withColumn("value", filterNull($"value")).show()
Option 2: no UDF
df.withColumn("value", explode($"value")).filter($"value".isNotNull).groupBy("key").agg(collect_list($"value")).show()
Note that this is less efficient...
Also you can use spark-daria it has: com.github.mrpowers.spark.daria.sql.functions.arrayExNull
from the documentation:
Like array but doesn't include null element