The following Scala (Spark 1.6) code for reading a value from a Row fails with a NullPointerException when the value is null.
val test = row.getAs[Int]("ColumnName").toString
while this works fine
val test1 = row.getAs[Int]("ColumnName") // returns 0 for null
val test2 = test1.toString // converts to String fine
What is causing NullPointerException and what is the recommended way to handle such cases?
PS: getting row from DataFrame as follows:
val myRDD = myDF.repartition(partitions)
.mapPartitions{ rows =>
rows.flatMap{ row =>
functionWithRows(row) //has above logic to read null column which fails
}
}
functionWithRows has then above mentioned NullPointerException.
MyDF schema:
root
|-- LDID: string (nullable = true)
|-- KTAG: string (nullable = true)
|-- ColumnName: integer (nullable = true)
getAs is defined as:
def getAs[T](i: Int): T = get(i).asInstanceOf[T]
and when we do toString we call Object.toString which doesn't depend on the type, so asInstanceOf[T] get dropped by the compiler, i.e.
row.getAs[Int](0).toString -> row.get(0).toString
we can confirm that by writing a simple scala code:
import org.apache.spark.sql._
object Test {
val row = Row(null)
row.getAs[Int](0).toString
}
and then compiling it:
$ scalac -classpath $SPARK_HOME/jars/'*' -print test.scala
[[syntax trees at end of cleanup]] // test.scala
package <empty> {
object Test extends Object {
private[this] val row: org.apache.spark.sql.Row = _;
<stable> <accessor> def row(): org.apache.spark.sql.Row = Test.this.row;
def <init>(): Test.type = {
Test.super.<init>();
Test.this.row = org.apache.spark.sql.Row.apply(scala.this.Predef.genericWrapArray(Array[Object]{null}));
Test.this.row().getAs(0).toString();
()
}
}
}
So the proper way would be:
String.valueOf(row.getAs[Int](0))
In order to avoid null values, it is a better practice is to use isNullAt before checking, as the documentation suggests:
getAs
<T> T getAs(int i)
Returns the value at position i. For primitive types if value is null it
returns 'zero value' specific for primitive ie. 0 for Int - use isNullAt to ensure that value is not null
I agree the behaviour is confusing, though.
Related
I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case):
package mwe
import org.apache.spark.sql.api.java.UDF1
class SomeFun extends UDF1[Int, Int] {
private var prop: Int = 0
override def call(input: Int): Int = {
if (prop == 0) {
prop = input
}
prop + input
}
}
Now I'm attempting to use this class from within pyspark:
import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext
conf = pyspark.SparkConf()
conf.set("spark.jars", "mwe.jar")
sc = SparkContext.getOrCreate(conf)
sqlContext = SQLContext.getOrCreate(sc)
sqlContext.registerJavaFunction("fun", "mwe.SomeFun")
df0 = sc.parallelize((i,) for i in range(6)).toDF(["num"])
df1 = df0.selectExpr("fun(num) + 3 as new_num")
df1.show()
And get the following exception:
pyspark.sql.utils.AnalysisException: u"cannot resolve '(UDF:fun(num) + 3)' due to data type mismatch: differing types in '(UDF:fun(num) + 3)' (struct<> and int).; line 1 pos 0;\n'Project [(UDF:fun(num#0L) + 3) AS new_num#2]\n+- AnalysisBarrier\n +- LogicalRDD [num#0L], false\n"
What is the correct way to implement this? Will I have to resort to Java itself for the class? I'd greatly appreciate hints!
The source of the exception is usage of incompatible types:
First of all o.a.s.sql.api.java.UDF* objects require external Java (not Scala types), so UDF expecting integers should take boxed Integer (java.lang.Integer) not Int.
class SomeFun extends UDF1[Integer, Integer] {
...
override def call(input: Integer): Integer = {
...
Unless you use legacy Python num column uses of LongType not IntegerType:
df0.printSchema()
root
|-- num: long (nullable = true)
So the actual signature should be
class SomeFun extends UDF1[java.lang.Long, java.lang.Long] {
...
override def call(input: java.lang.Long): java.lang.Long = {
...
or data should be casted before applying UDF
df0.selectExpr("fun(cast(num as integer)) + 3 as new_num")
Finally mutable state is not allowed in UDFs. It won't cause an exception but overall behavior will be non-deterministic.
I would like to create a DataSet (or DataStream) from a collection of case classes that contain Option values.
In the created table columns resulting from Option values should either contain NULL or the actual primitive value.
This is what I tried:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.table.api.TableEnvironment
import org.apache.flink.table.api.scala._
import org.apache.flink.types.Row
object OptionExample {
case class Event(time: Timestamp, id: String, value: Option[Int])
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)
val data = env.fromCollection(Seq(
Event(Timestamp.valueOf("2018-01-01 00:01:00"), "a", Some(3)),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "a", None),
Event(Timestamp.valueOf("2018-01-01 00:03:00"), "b", Some(7)),
Event(Timestamp.valueOf("2018-01-01 00:02:00"), "a", Some(5))
))
val table = tEnv.fromDataSet(data)
table.printSchema()
// root
// |-- time: Timestamp
// |-- id: String
// |-- value: Option[Integer]
val result = table
.groupBy('id)
.select('id, 'value.avg as 'averageValue)
// Print results
val ds: DataSet[Row] = result.toDataSet
ds.print()
}
}
But this causes an Exception in the aggregation part...
org.apache.flink.table.api.ValidationException: Expression avg('value) failed on input check: avg requires numeric types, get Option[Integer] here
...so with this approach Option does not get converted into a numeric type with NULLs as described above.
How can I achieve this with Flink?
(I'm coming from Apache Spark, there Datasets created from case classes with Options have this behaviour. I would like to achieve something similar with Flink)
I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case):
package mwe
import org.apache.spark.sql.api.java.UDF1
class SomeFun extends UDF1[Int, Int] {
private var prop: Int = 0
override def call(input: Int): Int = {
if (prop == 0) {
prop = input
}
prop + input
}
}
Now I'm attempting to use this class from within pyspark:
import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext
conf = pyspark.SparkConf()
conf.set("spark.jars", "mwe.jar")
sc = SparkContext.getOrCreate(conf)
sqlContext = SQLContext.getOrCreate(sc)
sqlContext.registerJavaFunction("fun", "mwe.SomeFun")
df0 = sc.parallelize((i,) for i in range(6)).toDF(["num"])
df1 = df0.selectExpr("fun(num) + 3 as new_num")
df1.show()
And get the following exception:
pyspark.sql.utils.AnalysisException: u"cannot resolve '(UDF:fun(num) + 3)' due to data type mismatch: differing types in '(UDF:fun(num) + 3)' (struct<> and int).; line 1 pos 0;\n'Project [(UDF:fun(num#0L) + 3) AS new_num#2]\n+- AnalysisBarrier\n +- LogicalRDD [num#0L], false\n"
What is the correct way to implement this? Will I have to resort to Java itself for the class? I'd greatly appreciate hints!
The source of the exception is usage of incompatible types:
First of all o.a.s.sql.api.java.UDF* objects require external Java (not Scala types), so UDF expecting integers should take boxed Integer (java.lang.Integer) not Int.
class SomeFun extends UDF1[Integer, Integer] {
...
override def call(input: Integer): Integer = {
...
Unless you use legacy Python num column uses of LongType not IntegerType:
df0.printSchema()
root
|-- num: long (nullable = true)
So the actual signature should be
class SomeFun extends UDF1[java.lang.Long, java.lang.Long] {
...
override def call(input: java.lang.Long): java.lang.Long = {
...
or data should be casted before applying UDF
df0.selectExpr("fun(cast(num as integer)) + 3 as new_num")
Finally mutable state is not allowed in UDFs. It won't cause an exception but overall behavior will be non-deterministic.
Imagine that you have the following case classes:
case class B(key: String, value: Int)
case class A(name: String, data: B)
Given an instance of A, how do I create a Spark Row? e.g.
val a = A("a", B("b", 0))
val row = ???
NOTE: Given row I need to be able to get data with:
val name: String = row.getAs[String]("name")
val b: Row = row.getAs[Row]("data")
The following seems to match what you're looking for.
scala> spark.version
res0: String = 2.3.0
scala> val a = A("a", B("b", 0))
a: A = A(a,B(b,0))
import org.apache.spark.sql.Encoders
val schema = Encoders.product[A].schema
scala> schema.printTreeString
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: integer (nullable = false)
val values = a.productIterator.toSeq.toArray
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val row: Row = new GenericRowWithSchema(values, schema)
scala> val name: String = row.getAs[String]("name")
name: String = a
// the following won't work since B =!= Row
scala> val b: Row = row.getAs[Row]("data")
java.lang.ClassCastException: B cannot be cast to org.apache.spark.sql.Row
... 55 elided
Very short but probably not the fastest as it first creates a dataframe and then collects it again :
import session.implicits._
val row = Seq(a).toDF().first()
#Jacek Laskowski answer is great!
To complete:
Here some syntactic sugar:
val row = Row(a.productIterator.toSeq: _*)
And a recursive method if you happen to have nested case classes
def productToRow(product: Product): Row = {
val sequence = product.productIterator.toSeq.map {
case product : Product => productToRow(product)
case e => e
}
Row(sequence : _*)
}
I don't think there exist a public API that can do it directly. Internally Spark uses Encoder.toRow method to convert objects org.apache.spark.sql.catalyst.expressions.UnsafeRow, but this method is private. You could try to:
Obtain Encoder for the class:
val enc: Encoder[A] = ExpressionEncoder()
Use reflection to access toRow method and set it to accessible.
Call it to convert object to UnsafeRow.
Obtain RowEncoder for the expected schema (enc.schema).
Convert UnsafeRow to Row.
I haven't tried this, so I cannot guarantee it will work or not.
I am trying to use Spark UDAF to summarize two existing columns into a new column. Most of the tutorials on Spark UDAF out there use indices to get the values in each column of the input Row. Like this:
input.getAs[String](1)
, which is used in my update method (override def update(buffer: MutableAggregationBuffer, input: Row): Unit). It works in my case as well. However I want to use the field name of the that column to get that value. Like this:
input.getAs[String](ColumnNames.BehaviorType)
, where ColumnNames.BehaviorType is a String object defined in an object:
/**
* Column names in the original dataset
*/
object ColumnNames {
val JobSeekerID = "JobSeekerID"
val JobID = "JobID"
val Date = "Date"
val BehaviorType = "BehaviorType"
}
This time it does not work. I got the following exception:
java.lang.IllegalArgumentException: Field "BehaviorType" does not
exist. at
org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:292)
... at org.apache.spark.sql.Row$class.getAs(Row.scala:333) at
org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:165)
at
com.recsys.UserBehaviorRecordsUDAF.update(UserBehaviorRecordsUDAF.scala:44)
Some relevant code segments:
This is part of my UDAF:
class UserBehaviorRecordsUDAF extends UserDefinedAggregateFunction {
override def inputSchema: StructType = StructType(
StructField("JobID", IntegerType) ::
StructField("BehaviorType", StringType) :: Nil)
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
println("XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX")
println(input.schema.treeString)
println
println(input.mkString(","))
println
println(this.inputSchema.treeString)
// println
// println(bufferSchema.treeString)
input.getAs[String](ColumnNames.BehaviorType) match { //ColumnNames.BehaviorType //1 //TODO WHY??
case BehaviourTypes.viewed_job =>
buffer(0) =
buffer.getAs[Seq[Int]](0) :+ //Array[Int] //TODO WHY??
input.getAs[Int](0) //ColumnNames.JobID
case BehaviourTypes.bookmarked_job =>
buffer(1) =
buffer.getAs[Seq[Int]](1) :+ //Array[Int]
input.getAs[Int](0)//ColumnNames.JobID
case BehaviourTypes.applied_job =>
buffer(2) =
buffer.getAs[Seq[Int]](2) :+ //Array[Int]
input.getAs[Int](0) //ColumnNames.JobID
}
}
The following is the part of codes that call the UDAF:
val ubrUDAF = new UserBehaviorRecordsUDAF
val userProfileDF = userBehaviorDS
.groupBy(ColumnNames.JobSeekerID)
.agg(
ubrUDAF(
userBehaviorDS.col(ColumnNames.JobID), //userBehaviorDS.col(ColumnNames.JobID)
userBehaviorDS.col(ColumnNames.BehaviorType) //userBehaviorDS.col(ColumnNames.BehaviorType)
).as("profile str"))
It seems the field names in the schema of the input Row are not passed into the UDAF:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
root
|-- input0: integer (nullable = true)
|-- input1: string (nullable = true)
30917,viewed_job
root
|-- JobID: integer (nullable = true)
|-- BehaviorType: string (nullable = true)
What is the problem in my codes?
I also want to use the field names from my inputSchema in my update method to create maintainable code.
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
class MyUDAF extends UserDefinedAggregateFunction {
def update(buffer: MutableAggregationBuffer, input: Row) = {
val inputWSchema = new GenericRowWithSchema(input.toSeq.toArray, inputSchema)
Ultimately switched to Aggregator which ran in half the time.