Spark/Scala - Failed to execute user defined function - scala

I have the following code that works.
val locList = Source.fromInputStream(getClass.getResourceAsStream("/locations.txt")).getLines().toList
def locCheck(col: String): Boolean = locList.contains(col)
def locUDF = udf[Boolean, String](locCheck)
But when I add a toUpperCase to make it
val locList = Source.fromInputStream(getClass.getResourceAsStream("/locations.txt")).getLines().toList
def locCheck(col: String): Boolean = locList.contains(col.toUpperCase)
def locUDF = udf[Boolean, String](locCheck)
I run into a Failed to execute user defined function caused by java.lang.NullPointerException
I using the udf as df.filter(locUDF('location)).count()
What am I doing wrong here and how do I fix it ?

There is nothing wrong with the function or udf. The problem is with the data that comes into the udf.
Here in your case if the column location have a null values, When you pass those values to udf the value of col is null.
Then You get a NullPointerException when you call col.toUpperCase in case col is null.
You can simply check the null values in function
def locCheck(col: String): Boolean = if (col == null) false else locList.contains(col.toUpperCase)
Or you can use Options to handle this as
def locCheck(col: String): Boolean =locList.contains(Option(col).map(_.toUpperCase))
Hope this helps!

Related

Chained Dataframe transformations in Scala with arguments and conditions

I have a dataframe in Scala for which I want to add transformations and filters depending on conditions passed as arguments to the function.
For example, I'm trying to do something like this:
val lst_conditions = List("condition1","condition2",..., "conditionN")
for (condition_string <- lst_conditions) {
var new_df = df.transform(FilterOrNot(condition_string))
}
But how I'm defining the function next doesn't work:
def FilterOrNot(c: String) (df: DataFrame): DataFrame = {
if (c == "condition1") df.filter($"price" >= $"avg_price")
else if (c == "condition2") df.filter($"price" >= $"median_price")
// If the condition is different do nothing.
}
The error I get is:
<console>:73: error: type mismatch;
found : Unit
required: org.apache.spark.sql.DataFrame
(which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
else if ...
^
How can I implement this?
I think perhaps a little more info about why the next function doesn't work, or what it's not doing would be useful.
The one thing I would maybe recommend adding is a final default to your custom transformation as follows:
def FilterOrNot(c: String) (df: DataFrame): DataFrame = {
if (c == "condition1") df.filter($"price" >= $"avg_price")
else if (c == "condition2") df.filter($"price" >= $"median_price")
else df
}

rdd.mapPartitions to return a Boolean from udf in Spark Scala

I am using Scala 2.11 with Spark 2.1
I have a MutableList[String] defined as variable objectKeys
I am trying to use Spark parallelize as follows:
val numPartitioning = 10
val rdd = sc.parallelize(objectKeys, numPartitioning);
val x = rdd.mapPartitions(read_files_from_list(objectKeys))
def read_files_from_list(keys:MutableList[String]): Boolean = {
// my logic to iterate over keys
if success
return true;
else
return false;
}
However I am getting the error type mismatch; found : Boolean required: Iterator[String] ⇒ Iterator[?] Error occurred in an application involving default arguments.
What changes do I need to do to have my udf 'read_files_from_list' accept a MutableList[String] and return a Boolean
mapPartitions expects an iterator to iterator transformation. You returning a constant value true/false as Boolean.
Here how you can write the function
def read_files_from_list(keys:Iterator[String]): Iterator[Boolean] = keys.map( key => {
// my logic to iterate over keys
if success
return true;
else
return false;
})

Spark UDF as function parameter, UDF is not in function scope

I have a few UDFs that I'd like to pass along as a function argument along with data frames.
One way to do this might be to create the UDF within the function, but that would create and destroy several instances of the UDF without reusing it which might not be the best way to approach this problem.
Here's a sample piece of code -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
val df = inputDF1
.withColumn("new_col", lkpUDF(col("c1")))
val df2 = inputDF2.
.withColumn("new_col", lkpUDF(col("c1")))
Instead of doing the above, I'd ideally want to do something like this -
val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
def appendCols(df: DataFrame, lkpUDF: ?): DataFrame = {
df
.withColumn("new_col", lkpUDF(col("c1")))
}
val df = appendCols(inputDF, lkpUDF)
The above UDF is pretty simple, but in my case it can return a primitive type or a user defined case class type. Any thoughts/ pointers would be much appreciated. Thanks.
Your function with the appropriate signature needs to be this:
import org.apache.spark.sql.UserDefinedFunction
def appendCols(df: DataFrame, func: UserDefinedFunction): DataFrame = {
df.withColumn("new_col", func(col("col1")))
}
The scala REPL is quite helpful in returning the type of the values initialized.
scala> val lkpUDF = udf{(i: Int) => if (i > 0) 1 else 0}
lkpUDF: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,List(IntegerType))
Also, if the signature of the function that you pass into the udf wrapper consists of an Any return type (which will be the case if the function can return either a primitive or a user defined case class), the UDF will fail to compile with an exception like so:
java.lang.UnsupportedOperationException: Schema for type Any is not supported

Try / Option with null

I'm searching for a possiblity in scala to call a function and get an Option as result which is "None" iff either an Exception is raised when calling the method or the method return null. Otherwise the Option should have the value of the result.
I know that Try can be used for the first part, but I don't know how to handle the second part:
val result = Try(myFunction).toOption()
If the method now returns null (because it is not a scala function but a Java function), result is Some(null) instead of None.
As I know there is only 1 method in scala standard library to convert null to None - Option.apply(x), so you have to use it manually:
val result = Try(myFunction).toOption.flatMap{Option(_)}
// or
val result = Try(Option(myFunction)).toOption.flatten
You could create your own helper method like this:
implicit class NotNullOption[T](val t: Try[T]) extends AnyVal {
def toNotNullOption = t.toOption.flatMap{Option(_)}
}
scala> Try(null: String).toNotNullOption
res0: Option[String] = None
scala> Try("a").toNotNullOption
res1: Option[String] = Some(a)
You can also do this:
val result = Try(myFunction).toOption.filter(_ != null)
which looks and feels better then .flatten or .flatMap(Option(_))
You can also do pattern matching as:
val result = myFunction() match {
case null => None
case _ => Some(_)
}
but the answer of #senia looks more "scala style"

How to check to see if a string is a decimal number in Scala

I'm still fairly new to Scala, and I'm discovering new and interesting ways for doing things on an almost daily basis, but they're not always sensible, and sometimes already exist within the language as a construct and I just don't know about them. So, with that preamble, I'm checking to see if a given string is comprised entirely of digits, so I'm doing:
def isAllDigits(x: String) = x.map(Character.isDigit(_)).reduce(_&&_)
is this sensible or just needlessly silly? It there a better way? Is it better just to call x.toInt and catch the exception, or is that less idiomatic? Is there a performance benefit/drawback to either?
Try this:
def isAllDigits(x: String) = x forall Character.isDigit
forall takes a function (in this case Character.isDigit) that takes an argument that is of the type of the elements of the collection and returns a Boolean; it returns true if the function returns true for all elements in the collection, and false otherwise.
Do you want to know if the string is an integer? Then .toInt it and catch the exception. Do you instead want to know if the string is all digits? Then ask one of:
s.forall(_.isDigit)
s matches """\d+"""
You also may consider something like this:
import scala.util.control.Exception.allCatch
def isLongNumber(s: String): Boolean = (allCatch opt s.toLong).isDefined
// or
def isDoubleNumber(s: String): Boolean = (allCatch opt s.toDouble).isDefined
You could simply use a regex for this.
val onlyDigitsRegex = "^\\d+$".r
def isAllDigits(x: String) = x match {
case onlyDigitsRegex() => true
case _ => false
}
Or simply
def isAllDigits(x: String) = x.matches("^\\d+$")
And to improve this a little bit, you can use the pimp my library pattern to make it a method on your string:
implicit def AllDigits(x: String) = new { def isAllDigits = x.matches("^\\d+$") }
"12345".isAllDigits // => true
"12345foobar".isAllDigits // => false
Starting Scala 2.13 we can use String::toDoubleOption, to determine whether a String is a decimal number or not:
"324.56".toDoubleOption.isDefined // true
"4.06e3".toDoubleOption.isDefined // true
"9w01.1".toDoubleOption.isDefined // false
Similar option to determine if a String is a simple Int:
"324".toIntOption.isDefined // true
"à32".toIntOption.isDefined // false
"024".toIntOption.isDefined // true
import scala.util.Try
object NumCruncher {
def isShort(aString: String): Boolean = Try(aString.toLong).isSuccess
def isInt(aString: String): Boolean = Try(aString.toInt).isSuccess
def isLong(aString: String): Boolean = Try(aString.toLong).isSuccess
def isDouble(aString: String): Boolean = Try(aString.toDouble).isSuccess
def isFloat(aString: String): Boolean = Try(aString.toFloat).isSuccess
/**
*
* #param x the string to check
* #return true if the parameter passed is a Java primitive number
*/
def isNumber(x: String): Boolean = {
List(isShort(x), isInt(x), isLong(x), isDouble(x), isFloat(x))
.foldLeft(false)(_ || _)
}
}
Try might not performance-wise be the optimal choice, but otherwise it's neat:
scala> import scala.util.Try
scala> Try{ "123x".toInt }
res4: scala.util.Try[Int] = Failure(java.lang.NumberFormatException: For input string: "123x")
scala> Try{ "123x".toInt }.isSuccess
res5: Boolean = false
#Jesper's answer is spot on.
Do NOT do what I'm suggesting below (explanation follows)
Since you are checking if a given string is numeric (title states you want a decimal), the assumption is that you intend to make a conversion if the forall guard passes.
A simple implicit in scope will save a whopping 9 key strokes ;-)
implicit def str2Double(x: String) = x.toDouble
Why this is dangerous
def takesDouble(x: Double) = x
The compiler will now allow takesDouble("runtime fail") since the implicit tries to convert whatever string you use to Double, with zero guarantee of success, yikes.
implicit conversions then seem better suited to situations where an acceptable default value is supplied on conversion failure (which is not always the case; therefore implicit with caution)
Here is one more:
import scala.util.Try
val doubleConverter: (String => Try[Double]) = (s: String) => Try{ s.map(c => if ((Character.isDigit(c) == true) || (c == '.')) Some(c) else None).flatten.mkString.toDouble }
val d1: Try[Double] = doubleConverter("+ 1234.0%")
val d2: Try[Double] = doubleConverter("+ 1234..0%")
Based on brilliant Jexter's solution, in this piece of code I take care of the NullPointerException using Option:
def isValidPositiveNumber(baseString: Option[String]): Boolean = baseString match {
case Some(code) => !code.isEmpty && (code forall Character.isDigit)
case None => false
}